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Abstract 

The PAXOS algorithm is an efficient and highly fault-tolerant algorithm, devised by 
Lamport, for reaching consensus in a distributed system. Although it appears to be 
practical, it seems to be not widely known or understood. This thesis contains a new 
presentation of the PAXOS algorithm, based on a formal decomposition into several 
interacting components. It also contains a correctness proof and a time performance 
and fault-tolerance analysis. 

The presentation is built upon a general timed automaton (GTA) model. The 
correctness proof uses automaton composition and invariant assertion methods. The 
time performance and fault-tolerance analysis is conditional on the stabilization of 
the underlying physical system behavior starting from some point in an execution. 
In order to formalize this stabilization, a special type of GTA called a Clock GTA is 
defined. 
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Chapter 1 



Introduction 



Reaching consensus is a fundamental problem in distributed systems. Given a dis- 
tributed system in which each process 1 starts with an initial value, to solve a consensus 
problem means to give a distributed algorithm that enables each process to eventu- 
ally output a value of the same type as the input values, in such a way that three 
conditions, called agreement, validity and termination, hold. There are different def- 
initions of the problem depending on what these conditions require. The agreement 
condition states requirements about the way processes need to agree (e.g., "no two 
different outputs occur"). The validity condition states requirements about the rela- 
tion between the input and the output values (e.g., "any output value must belong to 
the set of initial values"). The termination condition states requirements about the 
termination of an algorithm that solves the problem (e.g., "each non-faulty process 
eventually outputs a value"). Distributed consensus has been extensively studied; a 
good survey of early results is provided in [17]. We refer the reader to [35] for a more 
up-to-date treatment of consensus problems. 



^e remark that the words "process" and "processor" are often used as synonyms. The word 
"processor" is more appropriate when referring to a physical component of a distributed system. 
A physical processor is often viewed as consisting of several logical components, called "processes". 
Processes are composed to describe larger logical components, and the resulting composition is also 
called a process. Thus the whole physical processor can be identified with the composition of all its 
logical components. Whence the word "process" can also be used to indicate the physical processor. 
In this thesis we use the word "process" to mean either a physical processor or a logical component 
of it. The distinction either is unimportant or should be clear from the context. 



Consensus problems arise in many practical situations, such as, for example, dis- 
tributed data replication, distributed databases, flight control systems. Data repli- 
cation is used in practice to provide high availability: having more than one copy 
of the data allows easier access to the data, i.e., the nearest copy of the data can 
be used. However, consistency among the copies must be maintained. A consensus 
algorithm can be used to maintain consistency. A practical example of the use of data 
replication is an airline reservation system. The data consists of the current booking 
information for the flights and it can be replicated at agencies spread over the world. 
The current booking information can be accessed at any of the replicas. Reservations 
or cancellations must be agreed upon by all the copies. 

In a distributed database, the consensus problem arises when a collection of pro- 
cesses participating in the processing of a distributed transaction has to agree on 
whether to commit or abort the transaction, that is, make the changes due to the 
transaction permanent or discard the changes. A common decision must be taken to 
avoid inconsistencies. A practical example of the use of distributed transactions is a 
banking system. Transactions can be done at any bank location or ATM machine, 
and the commitment or abortion of each transaction must be agreed upon by all the 
bank locations or ATM machines involved. 

In a flight control system, the consensus problem arises when the flight surface 
and airplane control systems have to agree on whether to continue or abort a landing 
in progress or when the control systems of two approaching airplanes need to modify 
the air routes to avoid collision. 

Various theoretical models of distributed systems have been considered. A gen- 
eral classification of models is based on the kind of communications allowed between 
processes of the distributed system. There are two ways by which processes commu- 
nicate: by passing messages over communication channels or using a shared memory. 
In this thesis we focus on message-passing models. 

A wide variety of message-passing models can be used to represent distributed 
systems. They can be classified by the network topology, the synchrony of the system 
and the failures allowed. The network topology describes which processes can send 
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messages directly to which other processes and it is usually represented by a graph in 
which nodes represent processes and edges represent direct communication channels. 
Often one assumes that a process knows the entire network; sometimes one assumes 
that a process has only a local knowledge of the network (e.g., each process knows 
only the processes for which it has a direct communication channel). 

About synchrony, several model variations, ranging from the completely asyn- 
chronous setting to the completely synchronous one, can be considered. A completely 
asynchronous model is one with no concept of real time. It is assumed that messages 
are eventually delivered and processes eventually respond, but it may take arbitrarily 
long. In partially synchronous systems some timing assumptions are made. For ex- 
ample, upper bounds on the time needed for a process to respond and for a message 
to be delivered hold. These upper bounds are known by the processes and processes 
have some form of real-time clock to take advantage of the time bounds. In com- 
pletely synchronous systems, the computation proceeds in rounds in which steps are 
taken by all the processes. 

Failures may concern both communication channels and processes. In partially 
synchronous models, messages are supposed to be delivered and processes are ex- 
pected to act within some time bounds; a timing failure is a violation of these time 
bounds. Communication failures can result in loss of messages. Duplication and re- 
ordering of messages may be considered failures, too. The weakest assumption made 
about process failures is that a faulty process has an unrestricted behavior. Such a 
failure is called a Byzantine failure. More restrictive models permit only omission 
failures, in which a faulty process fails to send some messages. The most restrictive 
models allow only stopping failures, in which a failed process simply stops and takes 
no further actions. Some models assume that failed processes can be restarted. Often 
processes have some form of stable storage that is not affected by a stopping failure; 
a stopped process is restarted with its stable storage in the same state as before the 
failure and with every other part of its state restored to some initial values. 

In the absence of failures, distributed consensus problems are easy to solve: it is 
enough to exchange information about the initial values of the processes and use a 
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common decision rule for the output in order to satisfy both agreement and validity. 
Failures complicate the matter, so that distributed consensus can even be impossible 
to achieve. The difficulties depend upon the distributed system model considered 
and the exact definition of the problem (i.e., the agreement, validity and termination 
conditions). 

Real distributed systems are often partially synchronous systems subject to pro- 
cess, channel and timing failures and process recoveries. Today's distributed systems 
occupy larger and larger physical areas; the larger the physical area spanned by the 
distributed system, the harder it is to provide synchrony. Physical components are 
subject to failures. When a failure occurs, it is likely that, some time later, the prob- 
lem is fixed, restoring the failed component to normal operation. Moreover, though 
timely responses can usually be provided in real distributed systems, the possibility 
of process and channel failures makes it impossible to guarantee that timing assump- 
tions are always satisfied. Thus real distributed systems suffer timing failures. Any 
practical consensus algorithm needs to consider all the above practical issues. More- 
over, the basic safety properties must not be affected by the occurrence of failures. 
Also, the performance of the algorithm should be good when there are no failures. 

PAXOS is an algorithm devised by Lamport [29] that solves the consensus prob- 
lem. The model considered is a partially synchronous distributed system where each 
process has a direct communication channel with each other process. The failures 
allowed are timing failures, loss, duplication and reordering of messages, process stop- 
ping failures. Process recoveries are considered; some stable storage is needed. PAXOS 
is guaranteed to work safely, that is, to satisfy agreement and validity, regardless of 
process, channel and timing failures and process recoveries. When the distributed 
system stabilizes, meaning that there are no failures nor process recoveries and a 
majority of the processes are not stopped, for a sufficiently long time, termination is 
achieved; the performance of the algorithm when the system stabilizes is good. In 
[29] there is also presented a variation of PAXOS that considers multiple concurrent 
runs of PAXOS when consensus has to be reached on a sequence of values. We call 
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this variation the MULTIPAXOS algorithm 2 . 

The basic idea of the PAXOS algorithm is to propose values until one of them is 
accepted by a majority of the processes; that value is the final output value. Any 
process may propose a value by initiating a round for that value. The process initiat- 
ing a round is the leader of that round. Rounds are guaranteed to satisfy agreement 
and validity. A successful round, that is, a round in which a value is accepted by 
a majority of the processes, results in the termination of the algorithm. However 
a successful round is guaranteed to be conducted only when the distributed system 
stabilizes. Basically PAXOS keeps starting rounds while the system is not stable, but 
when the system stabilizes, a successful round is conducted. Though failures may 
force the algorithm to always start new rounds, a single round is not costly: it uses 
only linear, in the number of processes, number of messages and amount of time. 
Thus, PAXOS has good fault-tolerance properties and when the system is stable com- 
bines those fault-tolerance properties with the performance of an efficient algorithm, 
so that it can be useful in practice. 

In the original paper [29], the PAXOS algorithm is described as the result of discov- 
eries of archaeological studies of an ancient Greek civilization. That paper contains 
a sketch of a proof of correctness and a discussion of the performance analysis. The 
style used for the description of the algorithm often diverts the reader's attention. 
Because of this, we found the paper hard to understand and we suspect that others 
did as well. Indeed the PAXOS algorithm, even though it appears to be a practical 
and elegant algorithm, seems not widely known or understood, either by distributed 
systems researchers or distributed computing theory researchers. 

This thesis contains a new, detailed presentation of the PAXOS algorithm, based 
on a formal decomposition into several interacting components. It also contains a cor- 
rectness proof and a time performance and fault-tolerance analysis. The MULTIPAXOS 
algorithm is also described together with an application to data replication. 



2 PAXOS is the name of the ancient civilization studied in [29]. The actual algorithm is called the 
"single-decree synod" protocol and its variation for multiple consensus is called the "multi-decree 
parliament" protocol. We take the liberty of using the name PAXOS for the single-decree synod 
protocol and the name MULTIPAXOS for the multi-decree parliament protocol. 
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The formal framework used for the presentation is provided by Input/Output 
automata models. Input/Output automata are simple state machines with transitions 
labelled with actions. They are suitable for describing asynchronous and partially 
synchronous distributed systems. The basic I/O automaton model, introduced by 
Lynch and Tuttle [37], is suitable for modelling asynchronous distributed systems. 
For our purposes, we will use the general timed automaton (GTA) model, introduced 
by Lynch and Vandraager [38, 39, 40], which has formal mechanisms to represent 
the passage of time and is suitable for modelling partially synchronous distributed 
systems. 

The correctness proof uses automaton composition and invariant assertion meth- 
ods. Composition is useful for representing a system using separate components. 
This split representation is helpful in carrying out the proofs. We provide a modular 
presentation of the PAXOS algorithm, obtained by decomposing it into several com- 
ponents. Each one of these components copes with a specific aspect of the problem. 
In particular there is a "failure detector" module that detects process failures and 
recoveries. There is a "leader elector" module that copes with the problem of electing 
a leader; processes elected leader by this module, start new rounds for the PAXOS al- 
gorithm. The PAXOS algorithm is then split into a basic part that ensures agreement 
and validity and into an additional part that ensures termination when the system 
stabilizes; the basic part of the algorithm, for the sake of clarity of presentation, is 
further subdivided into three components. The correctness of each piece is proved 
by means of invariants, i.e., properties of system states that are always true in an 
execution. The key invariants we use in our proof are the same as in [31, 32]. 

The time performance and fault-tolerance analysis is conditional on the stabiliza- 
tion of the system behavior starting from some point in an execution. While it is 
easy to formalize process and channel failures, dealing formally with timing failures 
is harder. To cope with this problem, this thesis introduces a special type of GTA 
called a Clock GTA. The Clock GTA is a GTA augmented with a simple way of for- 
malizing timing failures. Using the Clock GTA we provide a technique for practical 
time performance analysis based on the stabilization of the physical system. 
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A detailed description of the MULTIPAXOS protocol is also provided. As an example 
of an application, the use of MULTIPAXOS to implement a data replication algorithm is 
presented. With MULTIPAXOS the high availability of the replicated data is combined 
with high fault tolerance. This is not trivial, since having replicated copies implies 
that consistency has to be guaranteed and this may result in low fault tolerance. 

Independent work related to PAXOS has been carried out. The algorithms in [11, 
34] have similar ideas. The algorithm of Dwork, Lynch and Stockmeyer [11] also uses 
rounds conducted by a leader, but the rounds are conducted sequentially, whereas 
in PAXOS a leader can start a round at any time and multiple simultaneous leaders 
are allowed. The strategy used in each round by the algorithm of [11] is somewhat 
different from the one used by PAXOS. Moreover the distributed model of [11] does 
not consider process recoveries. The time analysis provided in [11] is conditional on 
a "global stabilization time" after which process response times and message delivery 
times satisfy the time assumptions. This is similar to our analysis. (A similar time 
analysis, applied to a different problem, can be found in [16].) 

MULTIPAXOS can be easily used to implement a data replication algorithm. In 
[34] a data replication algorithm is provided. It incorporates ideas similar to the ones 
used in PAXOS. 

PAXOS bears some similarities with the standard three-phase commit protocol: 
both require, in each round, an exchange of 5 messages. However the standard commit 
protocol requires a reliable leader elector while PAXOS does not. Moreover PAXOS 
sends information on the value to agree on, only in the third message of a round, 
while the commit protocol sends it in the hrst message; because of this, MULTIPAXOS 
can exchange the hrst two messages only once for many instances and use only the 
exchange of the last three messages for each individual consensus problem while such 
a strategy cannot be used with the three-phase commit protocol. 

In the class notes of the graduate level Principles of Computer Systems course [31] 
taught at MIT, a description of PAXOS is provided using a specification language called 
SPEC. The presentation in [31] contains the description of how a round of PAXOS is 
conducted. The leader election problem is not considered. Timing issues are not 
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considered; for example, the problem of starting new rounds is not addressed. A 
proof of correctness, written also in SPEC, is outlined. Our presentation differs from 
that of [31] in the following aspects: it is based on I/O automata models rather than 
on a programming language; it provides all the details of the algorithm; it provides 
a modular description of the algorithm, including auxiliary parts such as a failure 
detector module and a leader elector module; along with the proof of correctness, 
it provides a performance and fault-tolerance analysis. In [32] Lampson provides 
an overview of the PAXOS algorithm together with the key points for proving the 
correctness of the algorithm. 

In [43] the clock synchronization problem has been studied; the solution provided 
there introduces a new type of GTA, called the mixed automaton model. The mixed 
automaton is similar to our Clock automaton with respect to the fact that both try 
to formally handle the local clocks of processes. However while the mixed automaton 
model is used to obtain synchronization of the local clocks, the Clock GTA automa- 
ton is used to model good timing behavior and thus does not need to cope with 
synchronization. 

Summary of contributions. This thesis provides a new, detailed and modular 
description of the PAXOS algorithm, a correctness proof and a time performance anal- 
ysis. The MULTIPAXOS algorithm is described and an application to data replication 
is provided. This thesis also introduces a special type of GTA model, called the Clock 
GTA model, and a technique for practical time performance analysis when the system 
stabilizes. 

Organization. This thesis is organized as follows. In Chapter 2 we provide a 
description of the I/O automata models and in particular we introduce the Clock 
GTA model. In Chapter 3 we discuss the distributed setting we consider. Chapter 4 
gives a formal definition of the consensus problem we consider. Chapter 5 is devoted 
to the design of a simple failure detector and a simple leader elector which will be 
used to give an implementation of PAXOS. Then in Chapter 6 we describe the PAXOS 
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algorithm, prove its correctness and analyze its performance. In Chapter 7 we describe 
the MULTIPAXOS algorithm. Finally in Chapter 8 we discuss how to use MULTIPAXOS 
to implement a data replication algorithm. Chapter 9 contains the conclusions. 
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Chapter 2 



Models 



In this chapter we describe the I/O automata models we use in this thesis. Section 2.1 
presents an overview of the automata models. Then, Section 2.2 describes the basic 
I/O automaton model, which is used in Section 2.3 to describe the MMT automaton 
model. Section 2.4 describes the general timed automaton model. In Section 2.5 the 
Clock GT automaton is introduced; Section 2.5 provides also a technique to transform 
an MMTA into a Clock GTA. Section 2.6 describes how automata are composed. 

2.1 Overview 

The I/O automata models are formal models suitable for describing asynchronous and 
partially synchronous distributed systems. Various I/O automata models have been 
developed so far (see, for example, [35]). The simplest I/O automata model does not 
consider time and thus it is suitable for describing asynchronous systems. We remark 
that in the literature this simple I/O automata model is referred to as the "I/O 
automaton model". However we prefer to use the general expression "I/O automata 
models" to indicate all the I/O automata models, henceforth we refer to the simplest 
one as the "basic I/O automaton model" (BIOA for short). Two extensions of the 
BIOA model that provide formal mechanisms to handle the passage of time and thus 
are suitable for describing partially synchronous distributed systems, are the MMT 
automaton (MMTA for short) and the general timed automaton (GT automaton or 
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GTA for short). The MMTA is a special case of GTA, and thus it can be regarded 
as a notation for describing some GT automata. 

In this thesis we introduce a particular type of GTA that we call "Clock GTA" . The 
Clock GTA is suitable for describing partially synchronous distributed systems with 
processors having local clocks; thus it is suitable for describing timing assumptions. In 
this thesis we use the GTA model and in particular the Clock GT automaton model. 
However, we use the MMT automaton model to describe some of the Clock GTAs 1 ; 
this is possible because an MMTA is a particular type of GTA; there is a standard 
technique that transforms an MMTA into a GTA and we specialize this technique in 
order to transform an MMT automaton into a Clock GTA. 

An I/O automaton is a simple type of state machine in which transitions are 
associated with named actions. These actions are classified into categories, namely 
input, output, internal and, for the timed models, time-passage. Input and output 
actions are used for communication with the external environment, while internal 
actions are local to the automaton. The time-passage actions are intended to model 
the passage of time. The input actions are assumed not to be under the control 
of the automaton, that is, they are controlled by the external environment which 
can force the automaton to execute the input actions. Internal and output actions 
are controlled by the automaton. The time-passage actions are also controlled by 
the automaton (though this may at hrst seem somewhat strange, it is just a formal 
way of modeling the fact that the automaton must perform some action before some 
amount of time elapses). 

As an example, we can consider an I/O automaton that models the behavior of 
a process involved in a consensus problem. Figure 2-1 shows the typical interface 
(that is, input and output actions) of such an automaton. The automaton is drawn 
as a circle, input actions are depicted as incoming arrows and output actions as 
outcoming arrows (internal actions are hidden since they are local to the automaton). 



^^The reason why we use MMT automata to describe some of our Clock GT automata is that 
MMT automata code is simpler. We use MMTA to describe the parts of the algorithm that can run 
asynchronously and we use the time bounds only for the analysis. 
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Init(v) Decide(v) 




Send(m) Receive(m) 

Figure 2-1: An I/O automaton. 

The automaton receives inputs from the external world by means of action Init(u), 
which represents the receipt of an input value v and conveys outputs by means of 
action Decide(u) which represents a decision of v. Actions Send(m) and Receive(m) 
are supposed to model the communication with other automata. 

2.2 The basic I/O automata model 

A signature S is a triple consisting of three disjoint sets of actions: the input ac- 
tions, in(S), the output actions, out(S), and the internal actions, int(S). The exter- 
nal actions, ext(S), are in(S) U out(S); the locally controlled actions, local(S), are 
out(S) U int(S); and acts(S) consists of all the actions of S. The external signature, 
extsig(S), is defined to be the signature (m(S'), out(S),ty). The external signature is 
also referred to as the external interface. 

A basic I/O automaton (BIOA for short) A, consists of five components: 

• sig(A), a signature 

• states(A) } a (not necessarily finite) set of states 

• start(A), a nonempty subset of states(A) known as the start states or initial 
states 

• trans (A), a state-transition relation, where trans(A) C states(A) X 
acts(sig(A)) X states (A)] this must have the property that for every state s 
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and every input action 7r, there is a transition (s,7r,s') £ trans(A) 

• tasks(A), a task partition, which is an equivalence relation on local(sig(A)) 
having at most countably many equivalence classes 

Often acts(A) is used as shorthand for acts(sig(A)) } and similarly m(A), and so on. 

An element (s,7r,s') of trans(A) is called a transition, or step, of A. If for a 
particular state s and action 7r, A has some transition of the form (s,7r,s'), then we 
say that tt is enabled in s. Input actions are enabled in every state. 

The fifth component of the I/O automaton definition, the task partition tasks(A) } 
should be thought of as an abstract description of "tasks," or "threads of control," 
within the automaton. This partition is used to define fairness conditions on an 
execution of the automaton; roughly speaking, the fairness conditions say that the 
automaton must continue, during its execution, to give fair turns to each of its tasks. 

An execution fragment of A is either a finite sequence, s , 7i"i, Si, 7r 2 , . . . ,7r r ,s r , 
or an infinite sequence, s , 7i"i, Si, 7r 2 , . . . , 7r r , s r , . . . , of alternating states and actions 
of A such that (sk,^k+i^k+i) is a transition of A for every k > 0. Note that if 
the sequence is finite, it must end with a state. An execution fragment beginning 
with a start state is called an execution. The length of a hnite execution fragment 
a = s , 7i"i, Si, 7r 2 , . . . , 7r r , s r is r. The set of executions of A is denoted by execs(A). 
A state is said to be reachable in A if it is the hnal state of a hnite execution of A. 

The trace of an execution a of A, denoted by trace(a) } is the subsequence of a 
consisting of all the external actions. A trace f3 of A is a trace f3 of an execution of 
A. The set of traces of A is denoted by traces(A). 

2.3 The MMT automaton model. 

An MMT timed automaton model is obtained simply by adding to the BIOA model 
lower and upper bounds on the time that can elapse before an enabled action is 
executed. Formally an MMT automaton consists of a BIOA and a boundmap b. A 
boundmap b is a pair of mappings, lower and upper which give lower and upper bounds 
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for all the tasks. For each tasks C, it is required that < lower(C) < upper(C) < oo 
and that lower(C) < oo. The bounds lower(C) and upper(C) are respectively, a 
lower bound and an upper bound on the time that can elapse before an enabled 
action belonging to C is executed. 

A timed execution of an MMT automaton B = (A, b) is defined to be a finite 
sequence a = s , (7i"i,ii), Si, (7T2,t 2 ),... , (7r r ,t r ), s r or an infinite sequence a = s , 
(7i"i, ti), si, (7r 2 , t 2 ), . . . , (71Y, t r ), s r , . . . , where the s's are states of the I/O automaton 
A, the 7r's are actions of A, and the t's are times in IR-°. It is required that the 
sequence s ,7Ti,Si, . . . — that is, the sequence a with the times ignored — be an ordi- 
nary execution of I/O automaton A. It is also required that the successive times t r 
in a be nondecreasing and that they satisfy the lower and upper bound requirements 
expressed by the boundmap b. 

Define r to be an initial index for a task C provided that C is enabled in s r and 
one of the following is true: (1) r = 0; (ii) C is not enabled in s r _i; (Hi) tt t £ C . The 
initial indices represent the points at which we begin to measure the time bounds of 
the boundmap. For every initial index r for a task C, it is required that the following 
conditions hold. (Let t = 0.) 

Upper bound condition: If there exists k > r with tk > t r + upper (C), then there 
exists k' > r with ty < t r -\- upper(C) such that either iry £ C or C is not enabled in 

Lower bound condition: There does not exist k > r with tk < t r + lower (C) and 

TTfc £ C. 

The upper bound condition says that, from any initial index for a task C, if time ever 
passes beyond the specified upper bound for C, then in the interim, either an action 
in C must occur, or else C must become disabled. The lower bound condition says 
that, from any initial index for C, no action in C can occur before the specified lower 
bound. 

The set of timed executions of B is denoted by texecs(B). A state is said to be 
reachable in B if it is the final state of some finite timed execution of B. 

A timed execution is admissible provided that the following condition is satisfied: 
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Admissibility condition: If timed execution a is an infinite sequence, then the 
times of the actions approach oo. If a is a finite sequence, then in the final state of 
a, if task C is enabled, then upper(C) = oo. 

The admissibility condition says that time advances normally and that processing 
does not stop if the automaton is scheduled to perform some more work. The set of 
admissible timed executions of B is denoted by atexecs(B). 

Notice that time bounds of the MMT substitute for the fairness conditions of a 
BIOA. 

The timed trace of a timed execution a of B, denoted by ttrace(a), is the sub- 
sequence of a consisting of all the external actions, each paired with its associated 
time. The admissible timed traces of B, which are denoted by attraces(B), are the 
timed traces of admissible timed executions of B. 

2.4 The GT automaton model 

The GTA model uses time-passage actions called v(t), t £ IR + to model the passage 
of time. The time-passage action vit) represents the passage of time by the amount 
t. 

A timed signature S is a quadruple consisting of four disjoint sets of actions: the 
input actions in(S), the output actions out(S), the internal actions int(S), and the 
time-passage actions. For a GTA 

• the visible actions, vis(S), are the input and output actions, in(S) U out(S) 

• the external actions, ext(S), are the visible and time-passage actions, vis(S) U 
{v(t) : t £ E+} 

• the discrete actions, disc(S), are the visible and internal actions, vis(S) U int(S) 

• the locally controlled actions, local(S), are the output and internal actions, 
out(S) U mt(S) 

• acts(S) are all the actions of S 
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A GTA A consists of the following four components: 

• sig(A), a timed signature 

• states(A) } a set of states 

• start(A), a nonempty subset of states(A) known as the start states or initial 
states 

• trans (A), a state transition relation, where trans(A) C states(A) X 
acts(sig(A)) X states (A)] this must have the property that for every state s 
and every input action 7r, there is a transition (s,7r,s') £ trans(A) 

Often acts(A) is used as shorthand for acts(sig(A)) } and similarly m(A), and so on. 

An element (s,7r,s') of trans(A) is called a transition, or step, of A. If for a 
particular state s and action 7r, A has some transition of the form (s,7r,s'), then we 
say that tt is enabled in s. Since every input action is required to be enabled in every 
state, automata are said to be input-enabled. The input-enabling assumption means 
that the automaton is not able to somehow "block" input actions from occurring. 

There are two simple axioms that A is required to satisfy: 

Al: If (s, v(t), s') and (V, v(t'), s") are in trans(A), then (s, v(t-\-t'), s") is in trans (A). 

A2: If (s, z/(t), s') £ trans(A) and < t' < t, then there is a state s" such that 
(s, v(t'), s") and (s", z/(t — t'), s') are in trans(A). 

Axiom Al allows repeated time-passage steps to be combined into one step, while 
Axiom A2 is a kind of converse to Al that allows a time-passage step to be split in 
two. 

A timed execution fragment of a GTA, A, is defined to be either a finite sequence 
a = s , 7i"i, Si, 7r 2 , . . . , 7r r , s r oi an infinite sequence a = s , 7i"i, Si, 7r 2 , . . . , 7r r , s r , . . . , 
where the s's are states of A, the 7r's are actions (either input, output, internal, or 
time-passage) of A, and (s/;, TTk+i } Sk+i) is a step ( or transition) of A for every k. Note 
that if the sequence is finite, it must end with a state. The length of a hnite execution 
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fragment a = s , 7i"i, Si, 7r 2 , . . . , 7r r , s r is r. A timed execution fragment beginning with 
a start state is called a timed execution. 

Axioms Al and A2, say that there is not much difference between timed execution 
fragments that differ only by splitting and combining time-passage steps. Two timed 
execution fragments a and a' are time-passage equivalent if a can be transformed 
into a' by splitting and combining time-passage actions according to Axioms Af and 
A2. 

If a is any timed execution fragment and tt t is any action in a, then we say that 
the time of occurrence of tt t is the sum of all the reals in the time-passage actions 
preceding tt t in a. A timed execution fragment a is said to be admissible if the sum 
of all the reals in the time-passage actions in a is oo. The set of admissible timed 
executions of A is denoted by atexecs(A). A state is said to be reachable in A if it is 
the final state of a finite timed execution of A. 

The timed trace of a timed execution fragment a, denoted by ttrace(a) } is the 
sequence of visible events in a, each paired with its time of occurrence. The admissible 
timed traces of A, denoted by attraces(A) } are the timed traces of admissible timed 
executions of A. 

We may refer to a timed execution simply as an execution. Similarly a timed trace 
can be referred to as a trace. 

2.5 The Clock GT automaton model 

A Clock GTA is a GTA with a special component included in the state; this special 
variable is called Clock and it can assume values in IR. The purpose of Clock is to 
model the local clock of the process. The only actions that are allowed to modify 
Clock are the time-passage actions v(t). When a time-passage action vit) is executed 
by an automaton, the Clock is incremented by an amount of time t' > independent 
of the amount t of time specified by the time-passage action 2 . Since the occurrence 



2 Formally, we have that if (s, v{t), s') is a step of an execution then also (s, v{t), s'), for any t > 0, 
is a step of that execution. Hence a Clock GTA cannot keep track of the real time. 
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of the time-passage action vit) represents the passage of (real) time by the amount 
t, by incrementing the local variable Clock by an amount t' different from t we are 
able to model the passage of (local) time by the amount t' . As a special case, we 
have some time-passage actions in which t' = t; in these cases the local clock of the 
process is running at the speed of real time. 

In the following and in the rest of the thesis, we use the notation s.x to denote the 
value of state component x in state s. 

Definition 2.5.1 A step (sk-i,v(t),Sk) of a Clock GTA is called regular if s^. Clock— 
Sk-i- Clock = t; it is called irregular if it is not regular. 

That is, a time-passage step executing action vit) is regular if it increases Clock by 
t' = t. In a regular time-passage step, the local clock is increased by the same amount 
as the real time, whereas in an irregular time-passage step vit) that represents the 
passage of real time by the amount t, the local clock is increased either by t' < t (the 
local clock is slower than the real time) or by t' > t (the local clock is faster than the 
real time). 

Definition 2.5.2 A timed execution fragment a of a Clock GTA is called regular if 
all the time-passage steps of a are regular. It is called irregular if it is not regular, 
i.e., if at least one of its time-passage step is irregular. 

In a partially synchronous distributed system processes are expected to respond 
and messages are expected to be delivered within given time bounds. A timing failure 
is a violation of these time bounds. An irregular time-passage step can model the 
occurrence of a timing failure. Thus in a regular execution fragment there are no 
timing failures. 

Transforming MMTA into Clock GTA. The MMT automata are a special case 
of GT automata. There is a standard transformation technique that given an MMTA 
produces an equivalent GTA, i.e., one that has the same external behavior (see Section 
23.2.2 of [35]). 
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In this section, we show how to transform any MMT automaton (A, b) into an 
equivalent clock general timed automaton A 1 = clockgen(A } b). Automaton A 1 acts 
like automaton A, but the time bounds of the boundmap b are expressed as restrictions 
on the value that the local time can assume. The technique used is essentially the same 
as the one that transforms an MMTA into an equivalent GTA with some modifications 
to handle the Clock variable, that is, the local time. 

The transformation involves building local time deadlines into the state and not 
allowing the local time to pass beyond those deadlines while they are still in force. 
The deadlines are set according to the boundmap b. New constraints on non-time- 
passage actions are added to express the lower bound conditions. Notice however, 
that all these constraints are on the local time, while in the transformation of an 
MMTA into a GTA they are on the real time. 

More specifically, the state of the underlying BIOA A is augmented with a Clock 
component, plus First(C) and Last(C) components for each task C . The First(C) 
and Last(C) components represent, respectively, the earliest and latest local times at 
which the next action in task C is allowed to occur. The time-passage actions vit) 
are also added. 

The First and Last components get updated by the various steps, according to 
the lower and upper bounds specified by the boundmap b. The time-passage actions 
vit) have an explicit precondition saying that the local time cannot pass beyond any 
of the Last(C) values; this is because these represent deadlines for the various tasks. 
Restrictions are also added on actions in any task C, saying that the current local 
time Clock must be at least as great as the lower bound First(C). 

In more detail, the timed signature of A 1 = clockgen(A } b) is the same as the 
signature of A, with the addition of the time-passage actions v(t) } t £ IR + . Each state 
of A 1 consists of the following components: 

basic £ states(A) } initially a start state of A 

Clock £ IR, initially arbitrary 

For each task C of A: 

First(C) £ IR, initially Clock + lower(C) if C is enabled in state basic, 
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otherwise 

Last(C) £ IR U {oo}, initially Clock + upper[C) if C is enabled in basic, 
otherwise oo 

The transitions are defined as follows. If tt £ acts(A), then (s,7r,s') £ trans (A') 
exactly if all the following conditions hold: 

1. (s. basic, tv, s'. basic) £ trans(A). 

2. s'. Clock = s. Clock. 

3. For each C £ tasks (A), 

(a) If tx £ C, then s.Firsi{C) < s. Clock. 

(b) If C is enabled in both s. basic and s' .basic and tt (ji C , then s.First(C) = 
s'.First(C) and s.Last(C) = s' .Last(C). 

(c) If C is enabled in s' .basic and either C is not enabled in s. basic or tt £ C, 
then s'.First(C) = s.Clock-\-lower(C) and s'.Last(C) = s.Clock-\-upper(C). 

(d) If C is not enabled in s' .basic, then s'.First(C) = and s'.Last(C) = oo. 
If 7r = ^(t), then (s,7r,s') £ trans(A') exactly if all the following conditions hold: 

1. s' .basic = s. basic. 

2. s'. Clock > s. Clock. 

3. For each C £ tasks (A), 

(a) s' . Clock < s.Last(C). 

(b) s'.FirsiC) = s.First(C) and s'.Last(C) = s.Last(C). 

The following lemma holds. 

Lemma 2.5.3 In any reachable state of clockgen(A,b) and for any task C of A, we 
have that. 

28 



1. Clock < Last(C). 

2. If C is enabled, then Last(C) < Clock + upper[C). 

3. First(C) < Clock +lower(C). 

4. First(C) < Last(C). 

If some of the timing requirements specified by b are trivial — that is, if some 
lower bounds are or some upper bounds are oo — then it is possible to simplify 
the automaton clockgen(A } b) just by omitting mention of these components. In this 
thesis all the MMT automata have boundmaps that specify a lower bound of and 
an upper bound of a fixed constant £; thus the above general transformation could 
be simplified (by omitting mention of First(C) and using £ instead of upper(C) } for 
any C) for our purposes. In the following lemma we consider lower(C) = and 
upper(C) = £. 

Lemma 2.5.4 Consider a regular execution fragment a of clockgen(A } b), starting 
from a reachable state s and lasting for more than £ time. Assume that lower(C) = 
and upper(C) = £ for each task C of automaton A. Then (i) any task C enabled in 
s either has a step or is disabled within £ time, and (ii) any new enabling of C has a 
subsequent step or disabling within £ time, provided that a lasts for more than £ time 
from the enabling of C . 

Proof: Let us hrst prove (z). Let C be a task enabled in state s . By Lemma 2.5.3 
we have that s .First(C) < s . Clock < s .Last(C) and that s .Last(C) < s .Clock-\-£. 
Since the execution is regular, within time £ } Clock passes the value s . Clock + £■ 
But this cannot happen (since s .Last(C) < s .Clock-\- £) unless Last(C) is increased, 
which means either C has a step or it is disabled within £ time. The proof of (ii) is 
similar. Let s be the state in which C becomes enabled. Then the proof is as before 
substituting s with s. I 
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2.6 Composition of automata 

The composition operation allows an automaton representing a complex system to be 
constructed by composing automata representing simpler system components. The 
most important characteristic of the composition of automata is that properties of 
isolated system components still hold when those isolated components are composed 
with other components. The composition identifies actions with the same name in 
different component automata. When any component automaton performs a step 
involving 7r, so do all component automata that have tt in their signatures. Since 
internal actions of an automaton A are intended to be unobservable by any other au- 
tomaton B } automaton A cannot be composed with automaton B unless the internal 
actions of A are disjoint from the actions of B. (Otherwise, A's performance of an 
internal action could force B to take a step.) Moreover, A and B cannot be composed 
unless the sets of output actions of A and B are disjoint. (Otherwise two automata 
would have the control of an output action.) 

Composition of BIOA. 

Let / be an arbitrary finite index set 3 . A finite countable collection {Si}i e i of signa- 
tures is said to be compatible if for all z, j £ /, i ^ j, the following hold 4 : 

f . mt(Si) n acts(Sj) = 

2. out(Si) n out(Sj) = 

A finite collection of automata is said to be compatible if their signatures are compat- 
ible. 

When we compose a collection of automata, output actions of the components be- 
come output actions of the composition, internal actions of the components become 
internal actions of the composition, and actions that are inputs to some components 



3 The composition operation for BIOA is defined also for an infinite but countable collection of 
automata [35], but we only consider the composition of a finite number of automata. 

4 We remark that for the composition of an infinite countable collection of automata, there is 
a third condition on the definition of compatible signature [35]. However this third condition is 
automatically satisfied when considering only finite sets of automata. 
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but outputs of none become input actions of the composition. Formally, the compo- 
sition S = Yli e i Si of a finite compatible collection of signatures {Si}i e i is defined to 
be the signature with 

• out(S) = Ui e iout(Si) 

• int(S) = Ui e iint(Si) 

• in(S) = U ie iin(Si) - U ie iout(Si) 

The composition A = riiej ^ °^ a finite collection of automata, is defined as 
follows: 5 

• sig(A) = Yl^ sig(Ai) 

• states(A) = Yl ieI states(Ai) 

• start(A) = Yliei start (Ai) 

• trans (A) is the set of triples (s,7r,s') such that, for all i £ /, if tt £ acts(Ai), 
then (s 8 ,7T,s') £ trans (Ai); otherwise s 8 - = s\ 

• tasks(A) = Ui e itasks(Ai) 

Thus, the states and start states of the composition automaton are vectors of states 
and start states, respectively, of the component automata. The transitions of the 
composition are obtained by allowing all the component automata that have a par- 
ticular action tt in their signature to participate simultaneously in steps involving 
7r, while all the other component automata do nothing. The task partition of the 
composition's locally controlled actions is formed by taking the union of the compo- 
nents' task partitions; that is, each equivalence class of each component automaton 
becomes an equivalence class of the composition. This means that the task structure 
of individual components is preserved when the components are composed. Notice 



5 The n notation in the definition of start(A) and states(A) refers to the ordinary Cartesian 
product, while the II notation in the definition of sig(A) refers to the composition operation just 
defined, for signatures. Also, the notation s; denotes the ith component of the state vector s. 
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that since the automata A{ are input-enabled, so is their composition. The following 
theorem follows from the definition of composition. 

Theorem 2.6.1 The composition of a compatible collection of BIO automata is a 
BIO automaton. 

The following theorems relate the executions and traces of a composition to those 
of the component automata. The hrst says that an execution or trace of a compo- 
sition "projects" to yield executions or traces of the component automata. Given 
an execution, a = s , 7i"i, Si, . . . , of A, let a\Ai be the sequence obtained by deleting 
each pair Tt r ,s r for which tt t is not an action of A{ and replacing each remaining s r 
by (s r )i, that is, automaton A 8 's piece of the state s r . Also, given a trace f3 of A (or, 
more generally, any sequence of actions), let f3\Ai be the subsequence of f3 consisting 
of all the actions of A{ in f3. Also, | represents the subsequence of a sequence f3 of 
actions consisting of all the actions in a given set in f3. 

Theorem 2.6.2 Let {A 8 } 8e / be a compatible collection of automata and let A = 

1. If a G execs (A), then a\Ai G execs (Ai) for every i G /. 

2. If f3 G traces(A), then f3\Ai G traces(Ai) for every i G /. 

The other two are converses of Theorem 2.6.2. The next theorem says that, under 
certain conditions, executions of component automata can be "pasted together" to 
form an execution of the composition. 

Theorem 2.6.3 Let {A 8 } 8e / be a compatible collection of automata and let A = 
Ylizi Ai. Suppose ai is an execution of Ai for every i G I, and suppose f3 is a sequence 
of actions in ext(A) such that f3\Ai = trace(cti) for every i G /. Then there is an 
execution a of A such that f3 = trace(a) and ai = a\Ai for every i G /. 

The final theorem says that traces of component automata can also be pasted 
together to form a trace of the composition. 
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Theorem 2.6.4 Let {A 8 } 8e / be a compatible collection of automata and let A = 
YlieiAi. Suppose f3 is a sequence of actions in ext(A). If f3\Ai £ traces(Ai) for every 
i £ I, then f3 £ traces(A) . 

Theorem 2.6.4 implies that in order to show that a sequence is a trace of a system, 
it is enough to show that its projection on each individual system component is a trace 
of that component. 

Composition of MMTA. 

MMT automata can be composed in much the same way as BIOA, by identifying 
actions having the same name in different automata. 

Let / be an arbitrary finite index set. A finite collection of MMT automata is 
said to be be compatible if their underlying BIO automata are compatible. Then the 
composition (A, b) = Y\iei{Ai, 6 8 ) of a finite compatible collection of MMT automata 
{(A 8 -, bi)}i e i is the MMT automaton defined as follows: 

• A = riie/^'' that is, A is the composition of the underlying BIO automata A 8 - 
for all the components. 

• For each task C of A, 6's lower and upper bounds for C are the same as those 
of ft,-, where A 8 - is the unique component I/O automaton having task C. 

Clearly we have the following theorem. 

Theorem 2.6.5 The composition of a compatible collection of MMT automata is an 
MMT automaton. 

The following theorems correspond to Theorems 2.6.2-2.6.4 stated for BIOA. 

Theorem 2.6.6 Let {Bi}i e i be a compatible collection of MMT automata and let 
B = YlieiBi- 

1. If a £ atexecs(B), then a\B{ £ atexecs(Bi) for every i £ /. 

2. If f3 £ attraces(B), then f3\Bi £ attraces(Bi) for every i £ /. 
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Theorem 2.6.7 Let {Bi}i e i be a compatible collection of MMT automata and let 
B = Yli e iBi. Suppose cti is an admissible timed execution of Bi for every i £ I 
and suppose f3 is a sequence of (action, time) pairs, where all the actions in f3 are in 
ext(A), such that f3\Bi = ttrace(cti) for every i £ /. Then there is an admissible timed 
execution a of B such that f3 = ttrace(a) and eti = a\B{ for every i £ /. 

Theorem 2.6.8 Let {Bi}i e i be a compatible collection of MMT automata and let 
B = Yli e iBi. Suppose f3 is a sequence of (action, time) pairs, where all the actions in 
f3 are in ext(A). If f3\Bi £ attraces(Bi) for every i £ I, then f3 £ attraces(B). 

Composition of GTA. 

Let / be an arbitrary finite index set. A finite collection {Si}i e i of timed signatures 
is said to be compatible if for all z, j £ /, i ^ j, we have 

f . mt(Si) n acts(Sj) = 

2. out(Si) n out(Sj) = 

A collection of GTAs is compatible if their timed signatures are compatible. 

The composition S = Y\iei Si °f a hnite compatible collection of timed signatures 
{S{}i e i is defined to be the timed signature with 

• out(S) = Ui e iout(Si) 

• int(S) = Ui e iint(Si) 

• in(S) = U ie iin(Si) - U ie iout(Si) 

The composition A = Yli e i Ai of a hnite compatible collection of GTAs {A 8 } 8e / is 
defined as follows: 

• sig(A) = Yl^ sig(Ai) 

• states(A) = Yl ieI states(Ai) 

• start(A) = Yliei start (Ai) 
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• trans(A) is the set of triples (s,7r,s') such that, for all i G /, if tt G acts(A 8 -), 
then (s 8 ,7T,s') G trans(A 8 ); otherwise s 8 - = s\ 

The transitions of the composition are obtained by allowing all the components that 
have a particular action tt in their signature to participate, simultaneously, in steps 
involving 7r, while all the other components do nothing. Note that this implies that 
all the components participate in time-passage steps, with the same amount of time 
passing for all of them. 

Theorem 2.6.9 The composition of a compatible collection of general timed au- 
tomata is a general timed automaton. 

The following theorems correspond to Theorems 2.6.2-2.6.4 stated for BIOA and 
to Theorems 2.6.6-2.6.8 stated for MMTA. Theorem 2.6.11, has a small technicality 
that is a consequence of the fact that the GTA model allows consecutive time-passage 
steps to appear in an execution. Namely, the admissible timed execution a that is 
produced by "pasting together" individual admissible timed executions a 8 - might not 
project to give exactly the original a 8 's, but rather admissible timed executions that 
are time-passage equivalent to the original a 8 's. 

Theorem 2.6.10 Let {Bi}i e i be a compatible collection of general timed automata 
and let B = Yliei^i- 

1. If a G atexecs(B), then a\B{ G atexecs(Bi) for every i G /. 

2. If f3 G attraces(B), then f3\Bi G attraces(Bi) for every i G /. 

Theorem 2.6.11 Let {Bi}i e i be a compatible collection of general timed automata 
and let B = YlieiBi. Suppose eti is an admissible timed execution of Bi for every 
i G /, and suppose f3 is a sequence of (action, time) pairs, with all the actions in 
vis(B), such that f3\Bi = ttrace(cti) for every i G /. Then there is an admissible 
timed execution a of B such that f3 = ttrace(a) and eti is time-passage equivalent to 
a\Bi for every i G /. 
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Theorem 2.6.12 Let {Bi}i e i be a compatible collection of general timed automata 
and let B = Yli e i B{. Suppose f3 is a sequence of (action, time) pairs, where all the ac- 
tions in f3 are in vis(A). If f3\Bi £ attraces(Bi) for every i £ / , then f3 £ attraces(B). 

Composition of Clock GTA. 

Clock GT automata are GT automata; thus, they can be composed as GT automata 
are composed. However we point out that the composition of Clock GT automata does 
not yield a Clock GTA but a GTA. This follows from the fact that in a composition 
of Clock GT automata there are more than one special state component Clock. It 
is possible to generalize the definition of Clock GTA by letting a Clock GTA have 
several special state components Clocki } Clock 2} ... so that the composition of Clock 
GT automata is still a Clock GTA. However we do not make this extension in this 
thesis, since for our purposes we do not need the composition of Clock GT automata 
to be a Clock GTA. 

2.7 Bibliographic notes 

The basic I/O automata was introduced by Lynch and Tuttle in [37]. The MMT 
automaton model was designed by Merritt, Modugno, and Tuttle [42]. More work 
on the MMT automaton model has been done by Lynch and Attiya [36]. The GT 
automaton model was introduced by Lynch and Vaandrager [38, 39, 40]. The book 
by Lynch [35] contains a broad coverage of these models and more pointers to the 
relevant literature. 
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Chapter 3 



The distributed setting 



In this chapter we discuss the distributed setting. We consider a complete network 
of n processes communicating by exchange of messages in a partially synchronous 
setting. Each process of the system is uniquely identified by its identifier i £ X, 
where X is a totally ordered finite set of n identifiers. The set X is known by all 
the processes. Moreover each process of the system has a local clock. Local clocks 
can run at different speeds, though in general we expect them to run at the same 
speed as real time. We assume that a local clock is available also for channels; though 
this may seem somewhat strange, it is just a formal way to express the fact that a 
channel is able to deliver a given message within a fixed amount of time, by relying 
on some timing mechanism (which we model with the local clock). We use Clock GT 
automata to model both processes and channels. 

Throughout the thesis we use two constants, £ and d, to represent upper bounds on 
the time needed to execute an enabled action and to deliver a message, respectively. 
These bounds do not necessarily hold for every action and message in every execution; 
a violation of these bounds is a timing failure. A Clock GTA models timing failures 
with irregular time-passage actions. 
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3.1 Processes 

A process is modeled by a Clock GT automaton. We allow process stopping failures 
and recoveries and timing failures. To formally model process stops and recoveries 
we model process i with a Clock GTA which has a special state component called 
Statusi and two input actions Stop; and Recover;. The state variable Statusi reflects 
the current condition of process i. The effect of action Stop; is to set Statusi to 
stopped, while the effect of Recover; is to set Statusi to alive. Moreover when 
Statusi = stopped, all the locally controlled actions are not enabled and the input 
actions have no effect, except for action Recover;. 

Definition 3.1.1 We say that a process i is alive (resp. stopped) in a given state if 
in that state we have Statusi = alive (resp. Statusi = stopped^. 

Definition 3.1.2 We say that a process i is alive (resp. stopped) in a given execution 
fragment, if it is alive (resp. stopped) in all the states of the execution fragment. 

Between a failure and a recovery a process does not lose its state. We remark that 
PAXOS needs only a small amount of stable storage (see Section 6.5); however, for 
simplicity, we assume that the entire state of a process is stable. 

Definition 3.1.3 A "process automaton" for process i is a Clock GTA having the 
special Statusi variable and input actions Stopi and Recover and whose behavior sat- 
isfies the following. The effect of action Stopi is to set Statusi to stopped, while 
the effect of Recover is to set Statusi to alive. In any reachable state s such that 
s. Status = stopped the only possible steps are (s,7r,s') where tt is an input action. 
Moreover when s. Status = stopped for all tt ^ Recover state s' is equal to state s. 

We also assume that there is an upper bound of £ on the elapsed (local) clock 
time if some locally controlled action is enabled. That is, if a locally controlled action 
becomes enabled, then it is executed within (local) time £ of the enabling (local) time, 
unless it becomes again disabled. This time bound is directly encoded into the steps 
of process automata. We remark that, when the execution is regular, the local clock 
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runs at the speed of real time and thus the time bound holds with respect to the real 
time, too. 

Finally, we provide the following definition of "stable" execution fragment of a 
given process. This definition will be used later to define a stable execution of a 
distributed system. 

Definition 3.1.4 Given a process automaton PROCESS;, we say that an execution 
fragment a o/ PROCESS; is "stable" if process i is either stopped or alive in a and a 
is regular. 

3.2 Channels 

We consider unreliable channels that can lose and duplicate messages. Reordering 
of messages is not considered a failure. Timing failures are also possible. Figure 3-f 
shows the code of a Clock GT automaton CHANNEL^-, which models the communi- 
cation channel from process i to process j; there is one automaton for each possible 
choice of i and j . Notice that we allow the possibility that the sender and the receiver 
are the same process. We denote by M. the set of messages that can be sent over the 
channels. The interface of CHANNEL^-, besides the actions modelling failures, consists 
of input actions Send(77i) 8J , m £ M, which are used by process i to send messages to 
process j, and output actions Receive(77i) 8J , m £ M, which are used by the channel 
automaton to deliver messages sent by process i to process j. 

Channel failures are formally modeled as input actions Lose 8J , and Duplicate; j. The 
effect of these two actions is to manipulate Msgs. In particular Lose 8J - deletes one 
message from Msgs; Duplicate^- duplicates one of the messages in Msgs. When the 
execution is regular, automaton CHANNEL^- guarantees that messages are delivered 
within time d of the sending. When the execution is irregular, messages can take 
arbitrarily long time to be delivered. 

The next lemma provides a basic property of CHANNEL^. 

Lemma 3.2.1 In a reachable state s of CHANNEL; J; if a message (m,t) £ s.MsgSjj 
then t < s.Clockij < t + d. 
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CHANNELij 




Signature: 




Input: Send(m)jj , Lose,-j, Duplicate,-^ 




Output: Receive(m)jj 




Time-passage: v{t) 




State: 




Clock G M, initially arbitrary 




Msgs, a set of elements of M xl, initially empty 




Actions: 




input Send(m)jj 


input Lose, ;j 


Eff: add (m, Clock) to Ms#s 


Eff: remove one element of Msgs 


output Receive(m)jj 


input Duplicate,- j 


Pre: (m,t) is in Msgs, for some t 


Eff: let (m,t) be an element of Msgs 


Eff: remove (m,t) from Ms</s 


let t' s.t. t <t' < Clock 




place (m,t r ) into Msgs 


time-passage v{t) 




Pre: Let f > be s.t. for all (m,t") G Msgs 




Clock + t' <t" + d 




Eff: Clock := Clock + t' 





Figure 3-1: Automaton CHANNEL 
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Proof: We prove the lemma by induction on the length k of an execution a = 
s 7TiSi . . . Sk-iTTk-Sk- The base k = is trivial since s .Msgs is empty. For the inductive 
step assume that the assertion is true in state Sk and consider the execution airs. We 
need to prove that the assertion is still true in s. Actions Lose 8J , Duplicate^, and 
Receive(ra)ij, do not add any new element to Msgs and do not modify Clock; hence 
they cannot make the assertion false. Thus we only need to consider the cases tt = 
Send(ra)ij and tt = v{t). If 7r = Send(m) 8J a new element (ra,i), with t = Sk- Clock 
is added to Msgs; however since Sk- Clock = s. Clock the assertion is still true in state 
s. If 7r = v(t), by the precondition of v(t), we have that s. Clock < t + d for all (m,t) 
in Msgs. Thus the assertion is true also in state s. I 

We remark that if CHANNEL^- is not in a reachable state then it may be unable 
to take time-passage steps, because Msgsij may contain messages (m,t) for which 
Clockij > t-\-d and thus the time-passage actions are no longer enabled, that is, time 
cannot pass. 
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The following definition of "stable" execution fragment for a channel captures the 
condition under which messages are delivered on time. 

Definition 3.2.2 Given a channel CHANNEL 8J; we say that an execution fragment 
a o/CHANNELj-j is "stable" if no Loseij and Duplicateij actions occur in a and a is 
regular. 

Next lemma proves that in a stable execution fragment messages are delivered 
within time d of the sending. 

Lemma 3.2.3 In a stable execution fragment a o/ CHANNEL; j beginning in a reach- 
able state s and lasting for more than d time, we have that (i) all messages (m,t) 
that in state s are in Msgsij are delivered by time d, and (ii) any message sent in a 
is delivered within time d of the sending, provided that a lasts for more than d time 
from the sending of the message. 

Proof: Let us hrst prove assertion (z). Let (m,t) be a message belonging to s.Msgs—. 
By Lemma 3.2. f we have that t < s.Clockij < t + d. However since a is stable, the 
time-passage actions increment Clockij at the speed of real time and since a lasts for 
more than d time, Clock passes the value t + d. However this cannot happen if m is 
not delivered since by the preconditions of vit) of CHANNEL^-, all the increments t' of 
Clockij are such that Clockij -\-t'<t-\-d. Notice that m cannot be lost (by a Lose 8J - 
action), since a is stable. 

Now let us prove assertion (ii). Let (V,Send(m) 8J , s") be the step that puts (ra,i), 
with t = s'. Clock, in Msgs. Since s'. Clock = s". Clock, we have that s".Clocki t j = t. 
Since a is stable, the time-passage actions increment Clockij at the speed of real time 
and since a lasts for more than d time from the sending of ra, Clockij passes the value 
t + d. However this cannot happen if m is not delivered since by the preconditions of 
vit) of CHANNEL; j , all the increments t' of Clockij are such that Clockij -\-t' < t + d. 
Again, notice that m cannot be lost (by a Lose 8J action), since a is stable. I 
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3.3 Distributed systems 

In this section we give a formal definition of distributed system. A distributed system 
is the composition of automata modelling channels and processes. We are interested 
in modelling bad and good behaviors of a distributed system; in order to do so 
we provide some definitions that characterize the behavior of a distributed system. 
The definition of "nice" execution fragment given in the following captures the good 
behavior of a distributed system. Informally, a distributed system behaves nicely 
if there are no process failures and recoveries, no channel failures and no irregular 
steps — remember that an irregular step models a timing failure — and a majority of 
the processes are alive. 

Definition 3.3.1 Given a set J C X of processes, a communication system for J is 
the composition of channel automata CHANNEL^- for all possible choices ofi,j £ J . 

Definition 3.3.2 A distributed system is the composition of process automata mod- 
eling some set J of processes and a communication system for J . 

In this thesis we will always compose automata that model the set of all processes 
X. Thus we define the communication system Scha to be the communication system 
for the set X of all processes. Figure 3-2 shows this communication system and its 
interactions with the external environment. 

Next we provide the definition of "stable" execution fragment for a distributed 
system exploiting the definition of stable execution fragment given previously for 
channels and process automata. 

Definition 3.3.3 Given a distributed system S , we say that an execution fragment 
a of S is "stable" if: 

1. for all automata PROCESS; modelling process i, i £ S it holds that a|PROCESS 8 - 
is a stable execution fragment for process i. 

2. for all channels CHANNEL^ with i,j £ S it holds that a | CHANNEL^ is a stable 
execution fragment for CHANNEL^. 
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Figure 3-2: The communication system S'cha 

Finally we provide the definition of "nice" execution fragment that captures the 
conditions under which PAXOS satisfies termination. 

Definition 3.3.4 Given a distributed system S , we say that an execution fragment 
a of S is "nice" if a is a stable execution fragment and a majority of the processes 
are alive in a. 

The above definition requires a majority of processes to be alive. As will be 
explained in Chapter 6, the property of majorities needed by the PAXOS algorithm 
is that any two majorities have one element on common. Hence any quorum scheme 
could be used. 

In the rest of the thesis, we will use the word "system" to mean "distributed 
system". 
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Chapter 4 



The consensus problem 



Several variations of the consensus problem have been studied. These variations 
depends on the model used. In this chapter we provide a formal definition of the 
consensus problem that we consider. 

4.1 Overview 

In a distributed system processes need to cooperate and fundamental to such cooper- 
ation is the problem of reaching agreement on some data upon which the computation 
depends. Well known practical examples of agreement problems arise in distributed 
databases, where data managers need to agree on whether to commit or abort a 
given transaction, and flight control systems, where the airplane control system and 
the flight surface control system need to agree on whether to continue or abort a 
landing in progress. 

In the absence of failures, achieving agreement is a trivial task. An exchange of 
information and a common rule to make a decision is enough. However the problem 
becomes much more complex in the presence of failures. 

Several different but related agreement problems have been considered in the 
literature. All have in common that processes start the computation with initial 
values and at the end of the computation each process must reach a decision. The 
variations mostly concern stronger or weaker requirements that the solution to the 
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problem has to satisfy. The requirement that a solution to the problem has to satisfy 
are captured by three properties, usually called agreement, validity and termination. 

As an example, the agreement condition may state that no two processes decide 
on different values, the validity condition may state that if all the initial values are 
equal then the (unique) decision must be equal to the initial value and the termination 
condition may state that every process must decide. A weaker agreement condition 
may require that only non-faulty processes agree on the decision (this weaker condition 
is necessary, for example, when considering Byzantine failures for which the behavior 
of a faulty process is unconstrained). A stronger validity condition may state that 
every decision must be equal to some initial value. 

It is clear that the definition of the consensus problem must take into account the 
distributed setting in which the problem is considered. 

About synchrony, several model variations, ranging from the completely asyn- 
chronous setting to the completely synchronous one, can be considered. A completely 
asynchronous model is one with no concept of real time. It is assumed that messages 
are eventually delivered and processes eventually respond, but it may take arbitrarily 
long. In a completely synchronous model the computation proceeds in a sequence 
of steps 1 . At each step processes receive messages sent in the previous step, per- 
form some computation and send messages. Steps are taken at regular intervals of 
time. Thus in a completely synchronous model, processes act as in a single syn- 
chronous computer. Between the two extremes of complete synchrony and complete 
asynchrony, other models with partial synchrony can be considered. These models 
assume upper bounds on the message transmission time and on the process response 
time. These upper bounds may be known or unknown to the processes. Moreover 
processes have some form of real-time clock to take advantage of the time bounds. 

Failures may concern both communication channels and processes. In synchronous 
and partially synchronous models, timing failures are considered. Communication 
failures can result in loss of messages. Duplication and reordering of messages may 



1 Usually these steps are called "rounds". However in this thesis we use the word "round" with a 
different meaning. 
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be considered failures, too. Models in which incorrect messages may be delivered 
are seldom considered since there are many techniques to detect the alteration of 
a message. The weakest assumption made about process failures is that a faulty 
process has an unrestricted behavior. Such a failure is called a Byzantine failure. 
Byzantine failures are often considered with authentication; authentication provides 
a way to sign messages, so that, even a Byzantine-faulty process cannot send a message 
with the signature of another process. More restrictive models permit only omission 
failures, in which a faulty process fails to send some messages. The most restrictive 
models allow only stopping failures, in which a failed process simply stops and takes 
no further actions. Some models assume that failed processes can be restarted. Often 
it is assumed that there is some form of stable storage that is not affected by a 
stopping failure; a stopped process is restarted with its stable storage in the same 
state as before the failure and with every other part of its state restored to some initial 
values. In synchronous and partially synchronous models messages are supposed to 
be delivered and processes are expected to act within some time bounds. A timing 
failure is a violation of those time bounds. 

Real distributed systems are often partially synchronous systems subject to pro- 
cess and channel failures. Though timely responses can be provided in real distributed 
systems, the possibility of process and channels failures makes impossible to guaran- 
tee that timing assumptions are always satisfied. Thus real distributed systems suffer 
timing failures, too. The possibility of timing failures in a partially synchronous dis- 
tributed system means that the system may as well behave like an asynchronous one. 
Unfortunately, reaching consensus in asynchronous systems, is impossible, unless it 
is guaranteed that no failures happen [18]. Henceforth, to solve the problem we need 
to rely on the timing assumptions. Since timing failures are anyway possible, safety 
properties, that is, agreement and validity conditions, must not depend at all on tim- 
ing assumptions. However we can rely on the timing assumptions for the termination 
condition. 
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4.2 Formal definition 

In Section 3 we have described the distributed setting we consider in this thesis. In 
summary, we consider a partial synchronous system of n processes in a complete 
network; processes are subject to stop failures and recoveries and have stable storage; 
channels can lose, duplicate and reorder messages; timing failures are also possible. 

Next we give a formal definition of the consensus problem we consider. 

For each process i there is an external agent that provides an initial value v 
by means of an action Init(u) 8 - 2 . We denote by V the set of possible initial values 
and, given a particular execution a, we denote by V a the subset of V consisting of 
those values actually used as initial values in a, that is, those values provided by 
Init(u)i actions executed in a. A process outputs a decision v by executing an action 
Decide(u) 8 -. If a process i executes action Decide(u) 8 - more than once then the output 
value v must be the same. 

To solve the consensus problem means to give a distributed algorithm that, for 
any execution a of the system, satisfies 

• Agreement: All the Decide(u) actions in a have the same v. 

• Validity: For any Decide(u) action in a, v belongs to V a . 

and, for any admissible execution a, satisfies 

• Termination: If a = f3~f and 7 is a nice execution fragment and for each 
process i alive in 7 an Init(u) 8 - action occurs in a, then any process i alive in 7, 
executes a Decide(u) 8 - action in a. 

The agreement and termination conditions require, as one can expect, that correct 
processes "agree" on a particular value. The validity condition is needed to relate 
the output value to the input values (otherwise a trivial solution, i.e. always output 
a default value, exists). 



2 We remark that usually it is assumed that for each process i the Init(v); action is executed at 
most once; however we do not need this assumption. 
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4.3 Bibliographic notes 

PAXOS solves the consensus problem in a partially synchronous distributed system 
achieving termination when the system executes a nice execution fragment. Allowing 
timing failures, the partially synchronous system may behave as an asynchronous 
one. A fundamental theoretical result, proved by Fischer, Lynch and Paterson [18] 
states that in an asynchronous system there is no consensus algorithm even in the 
presence of only one stopping failure. Essentially the impossibility result stem from 
the inherent difficulty of determining whether a process has actually stopped or is 
only slow. 

The PAXOS algorithm was devised by Lamport. In the original paper [29], the 
PAXOS algorithm is described as the result of discoveries of archaeological studies of 
an ancient Greek civilization. The PAXOS algorithm is presented by explaining how 
the parliament of this ancient Greek civilization worked. A proof of correctness is 
provided in the appendix of that paper. A time-performance analysis is discussed. 
Many practical optimizations of the algorithm are also discussed. In [29] there is 
also presented a variation of PAXOS that considers multiple concurrent runs of PAXOS 
when consensus has to be reached on a sequence of values. We call this variation the 
MULTIPAXOS algorithm. 

MULTIPAXOS can be easily used to implement a data replication algorithm. In 
[34] a data replication algorithm is provided. It incorporates ideas similar to the ones 
used in PAXOS. 

In the class notes of Principles of Computer Systems [31] taught at MIT, a de- 
scription of PAXOS is provided using a specification language called SPEC. The pre- 
sentation in [31] contains the description of how a round of PAXOS is conducted. The 
leader election problem is not considered. Timing issues are not considered; for ex- 
ample, the problem of starting new rounds is not addressed. A proof of correctness, 
written also in SPEC, is provided. Our presentation differs from that of [31] in the 
following aspects: it uses the I/O automata models; it provides all the details of the 
algorithm; it provides a modular description of the algorithm, including auxiliary 
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parts such as a failure detector module and a leader elector module; along with the 
proof of correctness, it provides a performance and fault-tolerance analysis. In [32] 
Lampson provides a brief overview of the PAXOS algorithm together with the key 
points for proving the correctness of the algorithm. 

In [11] three different partially synchronous models are considered. For each of 
them and for different types of failure an upper bound on the number of failures 
that can be tolerated is shown, and algorithms that achieve the bounds are given. A 
model studied in [11] considers a distributed setting similar to the one we consider in 
this thesis: a partially synchronous distributed system in which upper bounds on the 
process response time and message delivery time hold eventually; the failures con- 
sidered are process stop failures (also other models that consider omission failures, 
Byzantine failures with and without authentication are studied in [11]). The proto- 
col provided in [11], the DLS algorithm for short, needs a linear, in the number of 
processes, amount of time from the point in which the upper bounds on the process 
response time and message delivery time start holding. This is similar to the PAXOS 
performance which requires a linear amount of time to achieve termination when the 
system executes a nice execution fragment. However the DLS algorithm does not 
consider process recoveries and it is resilient to a number of process stopping failures 
which is less or equal to half the number of processes. This can be related to PAXOS 
by the fact that PAXOS requires a majority of processes alive to reach termination. 
The PAXOS algorithm is resilient also to channel failures while the DLS algorithm 
does not consider channel failures. 

PAXOS bears some similarities with the standard three-phase commit protocol: 
both require, in each round, an exchange of 5 messages. However the standard commit 
protocol requires a reliable leader elector while PAXOS does not. Moreover PAXOS 
sends information on the value to agree on only in the third message of a round (while 
the commit protocol sends it in the hrst message) and because of this, MULTIPAXOS 
can exchange the hrst two messages only once for many instances and use only the 
exchange of the last three messages for each individual consensus problem. 
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Chapter 5 



Failure detector and leader elector 



In this chapter we provide a failure detector algorithm and then we use it to implement 
a leader election algorithm, which in turn will be used in Chapter 6 to implement 
PAXOS. The failure detector and the leader elector we implement here are both sloppy, 
meaning that they are guaranteed to give accurate information on the system only in 
a stable execution. However, this is enough for implementing PAXOS. 

5.1 A failure detector 

In this section we provide an automaton that detects process failures and recoveries 
and we prove that the automaton satisfies certain properties that we will need in 
the rest of the thesis. We do not provide a formal definition of the failure detection 
problem, however, roughly speaking, the failure detection problem is the problem of 
checking which processes are alive and which ones are stopped. 

Without some knowledge of the passage of time it is not possible to detect failures; 
thus to implement a failure detector we need to rely on timing assumptions. Figure 
5-1 shows a Clock GT automaton, called DETECTOR^, c) 8 -. In our setting failures and 
recoveries are modeled by means of actions Stop; and Recover;. These two actions are 
input actions of DETECTOR^, c) 8 . Moreover DETECTOR^, c) % has InformStopped(j) 8 
and Inform Alive(j) 8 - as output actions which are executed when, respectively, the 
stopping and the recovering of process j are detected. Automaton DETECTOR^, c) 8 - 
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DETECTOR^, c) 8 






Signature: 






Input: Receive(m)j 8 ', Stop 8 -, Recover^ 




Internal: Check(j) 8 ' 






Output: InformStopped(j) 8 ' 


, Inform Alive (j 


i, Send(m) ij - 


Time-passage: v(i) 






State: 






Clock e m 


initially arbitrary 


Status G {alive, stopped} 


initially alive 




Alive G 2 1 


initially 2 




for all j <E 2: 






Prevrec(j) G i?-° 


initially arbitrary 


Lastinform(j) G i?-° 


initially Clock 




Lastsend(j) G i?-° 


initially Clock 




Lastcheck(j) G i?-° 


initially Clock 




Actions: 






input Stopi 




internal Check(j) 8 ' 


Eff: Status := stopped 




Pre: Status = alive 
j G Alive 


input Recover^ 




Eff: Lastcheck(j) := Clock + c 


Eff: Status := alive 




if Clock > Prevrec(j) + z + d then 


internal Send( "Alive" ) 8J ' 






Pre: Status = alive 




output InformStopped(j)j' 


Eff: Lastsend(j) := C7oe£+z 




Pre: Status = alive 
j (fi Alive 


input Receive( "Alive" )j 8 ' 




Eff: Lastinform(j) := Clock +£ 


Eff: if Status = alive then 






Prevrec(j) := Clock 




output Inform Alive(j) 8 ' 


if j ^ ^4/«i)e then 




Pre: Status = alive 


^4/zije := ^4/«i;e U {j} 




j G ^4/«^e 


Lastcheck(j) := Clock + 


c 


Eff: Lastinform(j) := Clock +£ 

time-passage i^(t) 
Pre: Status = alive 
Eff: Let t' be s.t. 

Vj, Clock + t' < Lastinform(j) 
Vj, Clock + t' < Lastsend(j) 
Vj, Clock + t' < Lastcheck(j) 
Clock := Clock + t' 



Figure 5-1: Automaton DETECTOR for process i 
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works by having each process constantly sending "Alive" messages to each other 
process and checking that such messages are received from other processes. It sends 
at least one "Alive" message in an interval of time of a fixed length z (i.e., if an 
"Alive" message is sent at time t then the next one is sent before time t + z) and 
checks for incoming messages at least once in an interval of time of a fixed length 
c. Let us denote by Sdet the system consisting of system Scha an d an automaton 
DETECTOR^, c)i for each process i £ X. Figure 5-2 shows Sdet- 



InformStopped (j) 



InformAlive (j) . 



InformStopped (j) 



InformAlive (j) 




Figure 5-2: The system Sdet 



Lemma 5.1.1 If an execution fragment a o/SdeTj starting in a reachable state and 
lasting for more than z + c + £ + 2 d time, is stable and process i is stopped in a, then 
by time z + c + £ + 2d, for each process j alive in a, an action InformStopped(i) j is 



executed and no subsequent Inform Alive(i) j action is executed in 



a. 



Proof: Let j be any alive process, and let t' be the Clock j value of process j at 
the beginning of a. Notice that, since a is stable, at time A in a, we have that 



Clock, 



t' + A. Now, notice that CHANNEL,- j is a subsystem of S'det? that is, 



5b ET is the composition of CHANNEL^ and other automata. By Theorem 2.6.10 
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the projection a|CHANNEL 8J is an execution fragment of CHANNEL^ and thus any 
property true for CHANNEL^ in a | CHANNEL^ is true for Sdet i n a ] i n particular 
we can use Lemma 3.2.3. Since a is stable and starts in a reachable state we have 
that a|CHANNEL 8J is stable and starts in a reachable state. Thus by Lemma 3.2.3, 
any message from i to j that is in the channel at the beginning of a is delivered by 
time d and consequently, since process i is stopped in a, no message from process i 
is received by process j after time d. We distinguish two possible cases. Let s be the 
hrst state of a after which no further messages from i are received by j and let tt s be 
the action that brings the system into state s. Notice that the time of occurrence of 
7r s is before or at time d. 

CASE 1: Process i £" s.AUve r Then, by the code of DETECTOR; an action 
InformStopped(z)j is executed within £ time after s. Clearly action InformStopped(z).,- 
is executed after s and, since the time of occurrence of tt s is < d then it is executed 
before or at time d + £. Moreover since no messages from i are received after s, no 
InformAlive(z)j can happen later on. Thus the lemma is proved in this case. 

CASE 2: Process i £ s.Alivej. Let Prevrec be the value of Clock^ at the moment 
when the last "Alive" message from i is received from j. Since no message from 
process i is received by process j after s and the time of occurrence of tt s is < d, 
we have that Prevrec < t' + d; indeed, as we observed before, at time A in a, we 
have that Clock j = t' + A, for any A. Since process i is supposed to send a new 
"Alive" message within z time from the previous one and the message may take up to 
d time to be delivered, a new "Alive" message from process i is expected by process 
j before Clock^ passes the value Prevrec + z + d. However, no messages are received 
when Clockj > Prevrec. By the code of DETECTOR^, c)j an action Check(z)j occurs 
after time Prevrec + z + d and before or at time Prevrec + z + c + d; indeed, a check 
action occur at least once in an interval of time of length c. When this action occurs, 
since Clock^ > Prevrec + z + d } it removes process i from the Alive j set (see code). 
Thus by time Prevrec + z + c + d process i is not in Alive j. Since Prevrec < t' + d } 
we have that process i is not in Alive j before Clock^ passes t' + z + c + 2d. Action 
InformStopped(z)j is executed within additional £ time, that is before Clock^ passes 
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t' + z + c + 2d + £. Notice also — and we will need this for the second part of the 
proof — that this action happens when Clock^ > t' + z + c + 2d > t' + d. Thus we 
have that action InformStopped(z).,- is executed by time z + c + 2 d + £. Since we are 
considering the case when process i is in Alive j at time d, action InformStopped(z).,- is 
executed after time d. This is true for any alive process j . Thus the lemma is proved 
also in this case. 

This concludes the proof of the hrst part of the lemma. Now, since no messages 
from i are received by j after time d, that is, no message from i to j is received 
when Clockj > t' + d and, by the hrst part of this proof, InformStopped(z).,- hap- 
pens when Clockj > t' + d, we have that no InformAlive(z).,- action can occur after 
InformStopped(z)j has occurred. This is true for any alive process j. Thus also the 
second part of the lemma is proved. I 

Lemma 5.1.2 If an execution fragment a o/SdeTj starting in a reachable state and 
lasting for more than z + d-\- £ time, is stable and process i is alive in a, then by time 
z + d + £, for each process j alive in a, an action Inform Alive(i) j is executed and no 
subsequent InformStopped(i)j action is executed in a. 

Proof: Let j be any alive process, and let t' be the value of Clocki and t" be the value 
of Clockj at the beginning of a. Notice that, since a is stable, at time A in a, we have 
that Clocki = t' -\- A and Clock^ = t" -\- A. Now, notice that CHANNEL^- is a subsystem 
of S'det? that is, S^et can be though of as the composition of CHANNEL^- and other 
automata. By Theorem 2.6.10 a|CHANNEL 8J is an execution of CHANNEL^- and thus 
any property of CHANNEL^- true in a | CHANNEL^- is true for Sdet i n a ] i n particular 
we can use Lemma 3.2.3. Since process i is alive in a and a is stable, process i sends 
an "Alive" message to process j by time z and, by Lemma 3.2.3, such a message is 
received by process j by time z + d. Whence, before Clock^ passes t" + z + d } action 
Receive( "Alive" )ij is executed and thus process i is put into Alive j (unless it was 
already there). Once process i is into Alive 17 within additional £ time, that is before 
Clockj passes t" + z + d + £ } or equivalently, by time z + d + £ } action InformAlive(z)j 
is executed. This is true for any process j. This proves the hrst part of the Lemma. 

54 



Let t be the time of occurrence of the hrst Receive( "Alive" ) 8J executed in a; by 
the hrst part of this lemma, t < z + d. Then since a is stable, process i sends at least 
one "Alive" message in an interval of time z and each message takes at most d to be 
delivered. Thus in any interval of time z + d process j executes a Receive( "Alive" ) 8J . 
This implies that the Clockj variable of process j never assumes values greater than 
Prevrec(i)j + z + d, which in turns imply that every Check(z).,- action does not remove 
process i from Alive j. Notice that process i may be removed from Alive j before time 
t. However it is put into Alive j at time t and it is not removed later on. Thus also 
the second part of the lemma is proved. I 

The strategy used by DETECTOR^, c) 8 - is a straightforward one. For this reason 
it is very easy to implement. However the failure detector so obtained is not reliable, 
i.e., it does not give accurate information, in the presence of failures (Stop;, Lose 8J , 
irregular executions). For example, it may consider a process stopped just because the 
"Alive" message of that process was lost in the channel. Automaton DETECTOR^, c) 8 - 
is guaranteed to provide accurate information on faulty and alive processes only when 
the system is stable. 

In the rest of this thesis we assume that z = £ and c = £, that is, we use 
DETECTOR^, £){. This particular strategy consists of sending an "Alive" message 
in each interval of £ time (i.e., we assume z = £) and of checking for incoming mes- 
sages at least once in each interval of £ time (i.e., we assume c = £). In practice 
the choice of z and c may be different. However from a theoretical point of view 
such a choice is irrelevant as it only affects the running time by a constant factor. 
Lemmas 5.1.3 and 5.1.4 can be restated as follows. 

Lemma 5.1.3 If an execution fragment a o/S'deT; starting in a reachable state and 
lasting for more than 3£ + 2d time, is stable and process i is stopped in a, then by 
time 3£ + 2d, for each process j alive in a, an action InformStopped(i) j is executed 
and no subsequent Inform Alive(i) j action is executed in a. 

Lemma 5.1.4 If an execution fragment a o/S'deT; starting in a reachable state and 
lasting for more than d + 2£ time, is stable and process i is alive in a, then by time 
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d + 2£, for each process j alive in a, an action Inform Alive(i) j is executed and no 
subsequent InformStopped(i)j action is executed in a. 

5.2 A leader elector 

Electing a leader in an asynchronous distributed system is a difficult task. An in- 
formal argument that explains this difficulty is that the leader election problem is 
somewhat similar to the consensus problem (which, in an asynchronous system sub- 
ject to failures is unsolvable [18]) in the sense that to elect a leader all processes 
must reach consensus on which one is the leader. As for the failure detector, we need 
to rely on timing assumptions. It is fairly clear how a failure detector can be used 
to elect a leader. Indeed the failure detector gives information on which processes 
are alive and which ones are not alive. This information can be used to elect the 
current leader. We use the DETECTOR^, ()i automaton to check for the set of alive 
processes. Figure 5-3 shows automaton LEADERELECTOR; which is an MMT automa- 
ton. Remember that we use MMT automata to describe in a simpler way Clock GT 
automata. Automaton LEADERELECTOR; interacts with DETECTOR(£,£); by means 
of actions InformStopped(j) 8 ', which inform process i that process j has stopped, and 
InformAlive(j) 8 ', which inform process i that process j has recovered. Each process 
updates its view of the set of alive processes when these two actions are executed. The 
process with the biggest identifier in the set of alive processes is declared leader. We 
denote with Slea the system consisting of Sdet composed with a LEADERELECTOR; 
automaton for each process i £ X. Figure 5-4 shows Slea- 

Since DETECTOR(£,£); is not a reliable failure detector, also LEADERELECTOR; is 
not reliable. Thus, it is possible that processes have different views of the system so 
that more than one process considers itself leader, or the process supposed to be the 
leader is actually stopped. However as the failure detector becomes reliable when the 
system Sdet executes a stable execution fragment (see Lemmas 5.1.3 and 5.1.4), also 
the leader elector becomes reliable when system Slea is stable. Notice that when 
S'lea executes a stable execution fragment, so does Sdet- 
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LEADERELECTOR; 






Signature: 

Input: InformStopped(j) 8 ', 
Output: Leader^, NotLeader 8 


Inform Alive(j) 8 ' 


Stop;, Recover; 


State: 

Status G {alive, stopped} 
Pool G 2 1 
Leader G 1 


initially alive 
initially {i} 
initially i 


Actions: 






input Stopi 
Eff: Status := stopped 




input Recover; 
Eff: Status := alive 


output Leader^ 
Pre: Status = alive 

i = Leader 
Eff: none 




output NotLeader; 
Pre: Status = alive 

i zfz Leader 
Eff: none 


input InformStopped(j) 8 ' 
Eff: if Status = alive then 
Pool := Pool \ {j} 
Leader := max of Pool 


input Inform Alive(j) 8 ' 
Eff: if Status = alive 

Pool := Pool U {j} 
Leader := max of Pool 


Tasks and bounds: 






{Leader^, NotLeader;}, bounds 


[0,£] 





Figure 5-3: Automaton LEADERELECTOR for process i 
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Leader ; 



NotLeader .■ 



NotLeader 



Leader , 



NotLeader , 



Leader l 




Figure 5-4: The system Slea 

Formally we consider a process i to be leader if Leader = i. That is a process i is 
leader if it consider itself to be the leader. This allows multiple or no leaders and does 
not require other processes to be aware of the leader or the leaders. The following 
definition gives a much more precise notion of leader. 

Definition 5.2.1 In a state s, there is a unique leader if and only if there exists an 
alive process i such that s. Leader = i and for all other alive processes j ^ i it holds 
that s. Leader j = i. 

Next lemma states that in a stable execution fragment, eventually there will be a 
unique leader. 



Lemma 5.2.2 If an execution fragment a o/S'leA; starting in a reachable state and 
lasting for more than 4:£-\-2d, is stable, then by time 4:£-\-2d, there is a state occurrence 
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s such that in state s and in all the states after s there is a unique leader. Moreover 
this unique leader is always the process with the biggest identifier among the processes 
alive in a. 

Proof: First notice that the system Slea consists of system Sdet composed with 
other automata. Hence by Theorem 2.6.10 we can use any property of Sdet- In 
particular we can use Lemmas 5.1.3 and 5.1.4 and thus we have that by time 3£ + 2d 
each process has a consistent view of the set of alive and stopped processes. Let i be 
the leader. Since a is stable and thus also regular, by Lemma 2.5.4, within additional 
£ time, actions Leader.,- and NotLeaderj are consistently executed for each process j, 
including process j = i. The fact that i is the the process with the biggest identifier 
among the processes alive in a follows directly from the code of LEADERELECTOR;. I 

We remark that, for many algorithms that rely on the concept of leader, it is 
important to provide exactly one leader. For example when the leader election is 
used to generate a new token in a token ring network, it is important that there is 
exactly one process (the leader) that generates the new token, because the network 
gives the right to send messages to the owner of the token and two tokens may result 
in an interference between two communications. For these algorithms, having two or 
more leaders jeopardizes the correctness. Hence the sloppy leader elector provided 
before is not suitable. However for the purpose of this thesis, LEADERELECTOR; is 
all we need. 

5.3 Bibliographic notes 

In an asynchronous system it is impossible to distinguish a very slow process from a 
stopped one. This is why the consensus problem cannot be solved even in the case 
where at most one process fails [18]. If a reliable failure detector were provided then 
the consensus problem would be solvable. This clearly implies that in a completely 
asynchronous setting no reliable failure detector can be provided. Chandra and Toueg 
[5] gave a definition of unreliable failure detector, and characterized failure detectors 
in terms of two properties: completeness, which requires that the failure detector 
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eventually suspect any stopped process, and accuracy, which restricts the mistakes a 
failure detector can make. No failure detector are actually implemented in [5]. The 
failure detector provided in this thesis, cannot be classified in the hierarchy defined 
in [5] since they do not consider channel failures. 

Chandra, Hadzilacos and Toueg [4] identified the "weakest" failure detector that 
can be used to solve the consensus problem. 

Failure detectors have practical relevance since it is often important to establish 
which processes are alive and which one are stopped. For example in electing a leader 
it is crucial to know which processes are alive and which ones are stopped. The need 
of having a leader in a distributed computation arise in many practical situations, 
like, for example, in a token ring network. However in asynchronous systems there is 
the inherent difficulty of distinguishing a stopped process from a slow one. 
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Chapter 6 



The PAXOS algorithm 



PAXOS was devised a very long time ago 1 but its discovery, due to Lamport, dates 
back only to 1989 [29]. 

In this chapter we describe the PAXOS algorithm, provide an implementation us- 
ing Clock GT automata, prove its correctness and analyze its performance. The 
performance analysis is given assuming that there are no failures nor recoveries, and 
a majority of the processes are alive for a sufficiently long time. We remark that 
when no restrictions are imposed on the possible failures, the algorithm might not 
terminate. 

6.1 Overview 

Our description of PAXOS is modular: we have separated various parts of the overall 
algorithm; each piece copes with a particular aspect of the problem. This approach 
should make the understanding of the algorithm much easier. The core part of the 
algorithm is a module that we call BASICPAXOS; this piece incorporates the basic 
ideas on which the algorithm itself is built. The description of this piece is further 
subdivided into three components, namely BPLEADER, BPAGENT and BPSUCCESS. 

In BASICPAXOS processes try to reach a decision by running what we call a 
"round". A process starting a round is the leader of that round. BASICPAXOS guar- 



^^The most accurate information dates it back to the beginning of this millennium [29]. 
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antees that, no matter how many leaders start rounds, agreement and validity are not 
violated. However to have a complete algorithm that satisfies termination when there 
are no failures for a sufficiently long time, we need to augment BASICPAXOS with an- 
other module; we call this module STARTERALG. The functionality of STARTERALG 
is to make the current leader start a new round if the previous one is not completed 
within some time bound. 

Leaders are elected by using the LEADERELECTOR algorithm provided in Chap- 
ter 5. We remark that this is possible because the presence of two or more leaders 
does not jeopardize agreement validity; however to get termination there must be a 
unique leader. 

Thus, our implementation of PAXOS is obtained composing the following au- 
tomata: CHANNEL^ for the communication between processes, DETECTOR; and 
LEADERELECTOR; for the leader election, BASICPAXOS; and STARTERALG;, for every 
process i,j £ X. The resulting system is called Spax an d it is shown in Figure 6-1; we 
have emphasized some of the interactions among the automata composing Spax an d 
some of the interactions with the external environment — input actions that model 
channel failures are not drawn; channels are not drawn. Figure 6-2 gives a more 
detailed view of the interaction among the automata composing BASICPAXOS;-. 

It is worth to remark that some pieces of the algorithm do need to be able to 
measure the passage of the time (DETECTOR;, STARTERALG; and BPSUCCESS;) while 
others do not. 

We will prove (Theorems 6.2.15 and 6.2.18) that the system Spax solves the con- 
sensus problem ensuring partial correctness — any output is guaranteed to be correct, 
that is agreement and validity are satisfied — and (Theorem 6.4.2) that Spax guar- 
antees also termination when the system executes a nice execution fragment, that is, 
without failures and recoveries and with at least a majority of the processes being 
alive. 
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Figure 6-1: PAXOS 



Init (v) 



Decide (v) 




Figure 6-2: BASICPAXOS 
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6.2 Automaton BASICPAXOS 

In this section we present the automaton BASICPAXOS which is the core part of the 
PAXOS algorithm. We begin by providing an overview of how automaton BASICPAXOS 
works, then we provide the automaton code along with a detailed description and 
finally we prove that it satisfies agreement and validity. 

6.2.1 Overview 

The basic idea, which is the heart of the algorithm, is to propose values until one of 
them is accepted by a majority of the processes; that value is the final output value. 
Any process may propose a value by initiating a round for that value. The process 
initiating a round is said to be the leader of that round while all processes, including 
the leader itself, are said to be agents for that round. Informally, the steps for a round 
are the following. 

1. To initiate a round, the leader sends a "Collect" message to all agents 2 an- 
nouncing that it wants to start a new round and at the same time asking for 
information about previous rounds in which agents may have been involved. 

2. An agent that receives a message sent in step I from the leader of the round, 
responds with a "Last" message giving its own information about rounds pre- 
viously conducted. With this, the agent makes a kind of commitment for this 
particular round that may prevent it from accepting (in step 4) the value pro- 
posed in some other round. If the agent is already committed for a round with 
a bigger round number then it informs the leader of its commitment with an 
"OldRound" message. 

3. Once the leader has gathered information about previous rounds from a majority 
of agents, it decides, according to some rules, the value to propose for its round 



2 Thus it sends a message also to itself. This helps in that we do not have to specify different 
behaviors for a process according to the fact that it is both leader and agent or just an agent. We 
just need to specify the leader behavior and the agent behavior. 
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and sends to all agents a "Begin" message announcing the value and asking 
them to accept it. In order for the leader to be able to choose a value for 
the round it is necessary that initial values be provided. If no initial value is 
provided the leader must wait for an initial value before proceeding with step 
3. The set of processes from which the leader gathers information is called the 
info-quorum of the round. 

4. An agent that receives a message from the leader of the round sent in step 3, 
responds with an "Accept" message by accepting the value proposed in the cur- 
rent round, unless it is committed for a later round and thus must reject the 
value proposed in the current round. In the latter case the agent sends an "01- 
dRound" message to the leader indicating the round for which it is committed. 

5. If the leader gets "Accept" messages from a majority of agents, then the leader 
sets its own output value to the value proposed in the round. At this point the 
round is successful. The set of agents that accept the value proposed by the 
leader is called the accepting-quorum. 

Since a successful round implies that the leader of the round reached a decision, 
after a successful round the leader still needs to do something, namely to broadcast 
the reached decision. Thus, once the leader has made a decision it broadcasts a 
"Success" message announcing the value for which it has decided. An agent that 
receives a "Success" message from the leader makes its decision choosing the value 
of the successful round. We use also an "Ack" message sent from the agent to the 
leader, so that the leader can make sure that everyone knows the outcome. 

Figure 6-3 shows: (a) the steps of a round r; (b) the response from an agent that 
informs the leader that an higher numbered round r' has been already initiated; (c) 
the broadcast of a decision. The parameters used in the messages will be explained 
later. Section 6.2.2 contains a description of the messages. 

Since different rounds may be carried out concurrently (several processes may 
concurrently initiate rounds), we need to distinguish them. Every round has a unique 
identifier. Next we formally define these round identifiers. A round number is a pair 
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Leader 



Voter 



Leader 



Voter 



(r, "Collect") 



(r, "Collect") 



(r, "Last",r',v) 



(r, " OldRound", r') 



(r, ' 'Begin ' ', v) 



(r, "Begin", v) 



(r, ' 'Accept ' ') 



(r, " OldRound", r') 



(a) 



(b) 



Leader 



Voter 



("Success",v) 



("Ack") 



(C) 



Figure 6-3: Exchange of messages 



(x,i) where x is a nonnegative integer and i is a process identifier. The set of round 
numbers is denoted by 1Z. A total order on elements of 1Z is defined by (x, i) < (y,j) 
iff x < y or, x = y and i < j. 

Definition 6.2.1 Round r "precedes" round r' if r < r' . 

If round r precedes round r' then we also say that r is a previous round, with 
respect to round r' . We remark that the ordering of rounds is not related to the 
actual time the rounds are conducted. It is possible that a round r' is started at some 
point in time and a previous round r, that is, one with r < r', is started later on. 

For each process z, we define a "+;" operation that given a round number (x,j) 
and an integer j/, returns the round number (x,j) +; y = (x + y, i). 
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Every round in the algorithm is tagged with a unique round number. Every 
message sent by the leader or by an agent for a round (with round number) r £ 1Z, 
carries the round number r so that no confusion among messages belonging to different 
rounds is possible. 

However the most important issue is about the values that leaders propose for 
their rounds. Indeed, since the value of a successful round is the output value of 
some processes, we must guarantee that the values of successful rounds are all equal 
in order to satisfy the agreement condition of the consensus problem. This is the 
tricky part of the algorithm and basically all the difficulties derive from solving this 
problem. Consistency is guaranteed by choosing the values of new rounds exploiting 
the information about previous rounds from at least a majority of the agents so that, 
for any two rounds there is at least one process that participated in both rounds. 

In more detail, the leader of a round chooses the value for the round in the following 
way. In step 1, the leader asks for information and in step 2 an agent responds with 
the number of the latest round in which it accepted the value and with the accepted 
value or with round number (0, j) and nil if the agent has not yet accepted a value. 
Once the leader gets such information from a majority of the agents (which is the 
info-quorum of the round), it chooses the value for its round to be equal to the value 
of the latest round among all those it has heard from the agents in the info-quorum 
or equal to its initial value if all agents in the info-quorum were not involved in any 
previous round. Moreover, in order to keep consistency, if an agent tells the leader of 
a round r that the last round in which it accepted a value is round r', r' < r, then 
implicitly the agent commits itself not to accept any value proposed in any other 
round r", r' < r" < r. 

Given the above setting, if r' is the round from which the leader of round r gets 
the value for its round, then, when a value for round r has been chosen, any round 
r", r' < r" < r, cannot be successful; indeed at least a majority of the processes 
are committed for round r, which implies that at least a majority of the processes 
are rejecting round r" . This, along with the fact that info-quorums and accepting- 
quorums are majorities, implies that if a round r is successful, then any round with 
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a bigger round number r' > r is for the same value. Indeed the information sent by 
processes in the info-quorum of round r' is used to choose the value for the round, 
but since info-quorums and accepting-quorums share at least one process, at least 
one of the processes in the info-quorum of round r' is also in the accepting-quorum 
of round r. Indeed, since the round is successful, the accepting-quorum is a majority. 
This implies that the value of any round r' > r must be equal to the value of round 
r, which, in turn, implies agreement. 

We remark that instead of majorities for info-quorums and accepting-quorums, 
any quorum system can be used. Indeed the only property that is required is that 
there is always a process in the intersection of any info-quorum with any accepting- 
quorum. 



Ballot 
number Value 




Figure 6-4: Choosing the values of rounds. Empty boxes denote that the process 
is in the info-quorum, and black boxes denote acceptance. Dotted lines indicate 
commitments. 

Example. Figure 6-4 shows how the value of a round is chosen. In this example we 
have a network of 5 processes, A,B,C,D,E (where the ordering is the alphabetical one) 
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and v A ,v B denote the initial values of A and B. At some point process B is the leader 
and starts round (1, B). It receives information from A, B, E (the set {A, B, E} is the info- 
quorum of this round). Since none of them has been involved in a previous round, process B 
is free to choose its initial value v B as the value of the round. However it receives acceptance 
only from B,C (the set {B,C} is the accepting-quorum for this round). Later, process A 
becomes the leader and starts round (2, A). The info-quorum for this round is {A,D,E}. 
Since none of this processes has accepted a value in a previous round, A is free to choose 
its initial value for its round. For round (2,D) the info-quorum is {C,D,E}. This time 
in the quorum there is process C that has accepted a value in round (1,-B) so the value of 
this round must be the same of that of round (1,_B). For round (3, A) the info-quorum is 
{A, B, E} and since A has accepted the value of round (2, A) then the value of round (2, A) 
is chosen for round (3, A). For round (3, B) the info-quorum is {A, C, D}. In this case there 
are three processes that accepted values in previous rounds: process A that has accepted 
the value of round (2, A) and processes C, D, that have accepted the value of round (2, D). 
Since round (2,D) is the higher round number, the value for round (3,-B) is taken from 
round (2,-D). Round (3,-B) is successful. 

To end up with a decision value, rounds must be started until at least one is 
successful. The basic consensus module BASICPAXOS guarantees that a new round 
does not violate agreement or validity, that is, the value of a new round is chosen in 
such a way that if the round is successful, it does not violate agreement and validity. 
However, it is necessary to make BASICPAXOS start rounds until one is successful. We 
deal with this problem in Section 6.3. 

6.2.2 The code 

In order to describe automaton BASICPAXOS; for process i we provide three automata. 
One is called BPLEADER; and models the "leader" behavior of the process; another 
one is called BPAGENT; and models the "agent" behavior of the process; the third one 
is called BPSUCCESS; and it simply takes care of broadcasting a reached decision. Au- 
tomaton BASICPAXOS; is the composition of BPLEADER;, BPAGENT; and BPSUCCESS;. 
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Figures 6-5 and 6-6 show the code for BPLEADER;, while Figure 6-7 shows the 
code for BPAGENT;. We remark that these code fragments are written using the 
MMTA model. Remember that we use MMTA to describe in a simpler way Clock 
GT automata. In section 2.3 we have described a standard technique to transform 
any MMTA into a Clock GTA. Figures 6-8 and 6-9 show automaton BPSUCCESS,-. 
The purpose of this automaton is simply to broadcast the decision once it has been 
reached by the leader of a round. The interactions among these automata are shown 
in Figure 6-2; Figure 6-3 describes the sequence of messages used in a round. 

It is worth to notice that the code fragments are "tuned" to work efficiently when 
there are no failures. Indeed messages for a given round are sent only once, that is, no 
attempt is made to try to cope with losses of messages and responses are expected to 
be received within given time bounds. Other strategies to try to conduct a successful 
round even in the presence of some failures could be used. For example, messages 
could be sent more than once to cope with the loss of some messages or a leader could 
wait more than the minimum required time before starting a new round abandoning 
the current one — this is actually dealt with in Section 6.3. We have chosen to send 
only one message for each step of the round: if the execution is nice, one message 
is enough to conduct a successful round. Once a decision has been made, there is 
nothing to do but try to send it to others. Thus once the decision has been made 
by the leader, the leader repeatedly sends the decision to the agents until it gets an 
acknowledgment. We remark that also in this case, in practice, it is important to 
choose appropriate time-outs for the re-sending of a message; in our implementation 
we have chosen to wait the minimum amount of time required by an agent to respond 
to a message from the leader; if the execution is stable this is enough to ensure that 
only one message announcing the decision is sent to each agent. 

We remark that there is some redundancy that derives from having separate au- 
tomata for the leader behavior and for the broadcasting of the decision. For exam- 
ple, both automata BPLEADER; and BPSUCCESS; need to be aware of the decision, 
thus both have a Decision variable (the Decision variable of BPSUCCESS; is updated 
when action RndSuccess; is executed by BPLEADER; after the Decision variable of 
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BPLEADER; 






Signature: 






Input: Receive(m)j i j, mG{"Last", "Accept", "Success", "OldRound" } 




Init(i>);, NewRound;, Stop;, Recover;, Leader;, NotLeader; 




Internal: Collect;, GatherLast;, Continue 8 


GatherAccept; , GatherOldRound; 




Output: Send(m);j, m G {"Collect", "Be 


gin"} 




BeginCast;, RndSuccess(i>); , 






States: 






Status G {alive, stopped} 


nitially alive 




IamLeader, a boolean 


nitially false 




Mode G {collect ,gatherlast , 






wait , begincast , gatheraccept 






decided, rnddone} 


nitially rnddone 




Init Value G V Unil 


nitially nil 




Decision £ VU {nil} 


nitially nil 




CurRnd G 1Z 


nitially (0, i) 




HighestRnd G 1Z 


nitially (0, i) 




RndValue G V U {nil} 


nitially nil 




RndVFrom G 1Z 


nitially (0, i) 




RndlnfQuo G 2 1 


nitially {} 




RndAccQuo G 2 1 


nitially {} 




InMsgs, multiset of messages 


nitially {} 




OutMsgs, multiset of messages 


nitially {} 




Tasks and bounds: 






{Collect;, GatherLast;, Continue;, BeginCa 


st;, GatherAccept;, RndSuccess(i>); }, bounds 


[0,4 


{GatherOldRound;}, bounds [0,£] 






{Send(m);j : m G M}, bounds [0,£] 






Actions: 






input Stop; 


input Recover; 




Eff: Status := stopped 


Eff: Status := alive 




input Leader; 


input NotLeader; 




Eff: if Status = alive then 


Eff: if Status = alive then 




IamLeader := true 


IamLeader := false 




output Send(m);j 


input Receive(m)j ) ; 




Pre: Status = alive 


Eff: if Status = alive then 




rriij G OutMsgs 


add rriji to InMsgs 




Eff: remove m,-j from OutMsgs 







Figure 6-5: Automaton BPLEADER for process i (part 1) 
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Actions: 




input Init(v); 


internal Continue^ 


Eff: if Status = alive then 


Pre: Status = alive 


Init Value := v 


Mode = wait 




InitValue ^ nil 


input NewRoundj- 


Eff: if RndValue = nil then 


Eff: if Status = alive then 


RndValue := InitValue 


CurRnd := HighestRnd +; 1 


Mode := begincast 


HighestRnd := CurRnd 




Mode := collect 


output BeginCastj- 




Pre: Status = alive 


internal Collect; 


Mode = begincast 


Pre: Status = alive 


Eff: Vj put (CurRnd, "Begin" ,RndValue)ij 


Mode = collect 


in OutMsgs 


Eff: RndVFrom := (0,i) 


Mode := gatheraccept 


RndlnfQuo := {}; RndAccQuo := {} 




Vj put (CurRnd, "Collect" ) 8J ' 


internal GatherAccept 8 - 


in OutMsgs 


Pre: Status = alive 


Mode := gatherlast 


Mode = gatheraccept 




m = (r, "Accept" )j 8 ' £ InMsgs 


internal GatherLast; 


CurRnd = r 


Pre: Status = alive 


Eff: remove all copies of m from InMsgs 


Mode = gatherlast 


RndAccQuo := RndAccQuo Ujj} 


m = (r , "Last" ,r',v)ji £ InMsgs 


if |i?riGL4ccQMo| > n/2 then 


CurRnd=r 


Decision := RndValue 


Eff: remove all copies of m from InMsgs 


Mode := decided 


RndlnfQuo := RndlnfQuo U {j} 




if RndVFrom < r' and i> 7^ nil then 


output RndSuccess(_Dee«s«cm)j' 


RndValue := i> 


Pre: Status = alive 


RndVFrom := r' 


Mode = decided 


if |i?nrf/n/(5Mo| > n/2 then 


Eff: Mode = rnddone 


if RndValue = nil and 




InitValue ^ nil then 


internal GatherOldRoundj- 


RndValue := InitValue 


Pre: Status = alive 


if RndValue ^ nil then 


m = (r, "OldRound" ,r')j 8 ' £ InMsgs 


Mode := begincast 


CurRnd < r 


else 


Eff: remove m from InMsgs 


Mode := wait 


HighestRnd := r' 



Figure 6-6: Automaton BPLEADER for process z (part 2) 
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BPAGENT; 






Signature: 






Input: Receive(m)j i j, m G {"Collect", "Begin" 


} 


Init(v);, Stop;, Recover; 






Internal: Last Accept;, Accept; 






Output: Send(m)jj, m G {"Last", ", 


Accept" , "OldRound" } 


States: 






Status G {alive, stopped} 


initially al 


ive 


LastR G n 


initially (0, 


i) 


LastV G VU{nil} 


initially nil 


Commit G 7£ 


initially (0, 


i) 


InMsgs, multiset of messages 


initially {} 




OutMsgs, multiset of messages 


initially {} 




Tasks and bounds: 






{LastAccept;}, bounds [0,£] 






{Accept;}, bounds [0,£] 






{Send(m)jj : m G M}, bounds [0,£] 






Actions: 






input Stop; 




input Recover; 


Eff: Status := stopped 




Eff: Status := alive 


output Send(m)jj 




input Receive(m)j 8 ' 


Pre: Status = alive 




Eff: if Status = alive then 


rriij G OutMsgs 




add rriji to InMsgs 


Eff: remove m,-j from OutMsgs 




internal Accepts 


internal LastAccept; 




Pre: Status = alive 


Pre: Status = alive 




m = (r, "Begin", v)^; G InMsgs 


m = (r, "Collect" )j 8 ' G InMsgs 




Eff: remove all copies of m from InMsgs 


Eff: remove all copies of m from InMsgs 


if r > Commit then 


if r > Commit then 




put (r, "Accept"); j in InMsgs 


Commit := r 




lastR := r, lastV := i> 


put (r,"Last" ,LastR,LastV)i u 




else 


in OutMsgs 




put (r, "OldRound" , Commii)ij 


else 




in OutMsgs 


put (r, "OldRound" ,Commii) 


i,j 




in OutMsgs 




input Init(i>); 
Eff: if Status = alive then 
lastV := w 



Figure 6-7: Automaton BPAGENT for process z 
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BPSUCCESS; 




Signature: 




Input: Receive(m)j i j, m G {"Ack", "Success"} 


Stop;, Recover;, Leadei 


i, NotLeader;, RndSuccess(i>); 


Internal: SendSuccess; , GatherSuccess; , GatherAck;, Wait; 


Output: Decide(i>);, Send( "Success" ,v)ij 


Time-passage: v{t) 




State: 




Clock G M 


nitially arbitrary 


Status G {alive, stopped} 


nitially alive 


Decision £ VU {nil} 


nitially nil 


IamLeader, a boolean 


nitially false 


Acked(j), a boolean Vj G 1 


nitially all false 


Prevsend G M U {nil} 


nitially nil 


LastSend £lU {00} 


nitially 00 


LastWait GMU{oo} 


nitially 00 


LastGA GMU{oo} 


nitially 00 


LastGS G M U {00} 


nitially 00 


ias^S 1 £1U {00} 


nitially 00 


InMsgs, multiset of messages 


nitially {} 


OutMsgs, multiset of messages 


nitially {} 


Actions: 




input Stop; 


input Recover; 


Eff: Status := stopped 


Eff: Status := alive 


input Leader; 


input NotLeader; 


Eff: if Status = alive then 


Eff: if Status = alive then 


IamLeader := true 


IamLeader := false 


output Send(m)jj 


input Receive(m)j 8 ' 


Pre: Status = alive 


Eff: if Status = alive then 


rriij G OutMsgs 


put rriji into InMsgs 


Eff: remove m,-j from OutMsgs 


if m is an "Ack" message and 


if OutMsgs is empty 


LastGA = 00 then 


LastSend := 00 


LastGA = Clock + £ 


else 


if m is an "Success" message and 


LastSend := Clock + £ 


LastGS = 00 then 




LastGS = Clock +£ 


input RndSuccess(i>); 




Eff: if Status = alive then 




Decision := i> 




LastSS:= Clock + £ 





Figure 6-8: Automaton BPSUCCESS for process i (part 1) 
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internal SendSuccess 8 - 


internal GatherAckj- 


Pre: Status = alive, IamLeader = true 


Pre: Status = alive 


Decision ^ nil, PrevSend = nil 


m =("Ack")j )8 - G InMsgs 


3j ^ i s.t. Acked(j) = false 


Eff: remove all copies of m from InMsgs 


Eff: Vj 7^ i s.t. Acked(j) = false 


Acked(j) := true 


put ("Success" ,Decision)ij 


if no other "Ack" is in InMsgs then 


in OutMsgs 


LastGA := oo 


PrevSend := Clock 


else 


LastSend := Clock + £ 


lastGA := Clock + £ 


LastWait := Clock + {U + 2n£ + 2d) + £ 




LastSS := oo 


internal Wait 8 




Pre: Status = alive, PrevSend ^ nil 


internal GatherSuccesSj- 


Clock > PrevSend + (U + 2n£ + 2d) 


Pre: Status = alive 


Eff: PrevSend := nil 


m = ( "Success", w)^; G InMsgs 


LastWait := oo 


Eff: remove all copies of m from InMsgs 




Decision := w 


time-passage v(t) 


put ("Ack") 8) j in OutMsgs 


Pre: Status = alive 




Eff: Let t' be s.t. 


output Decide(i>)j' 


Clock+t' < LastSend 


Pre: Status = alive 


Clock+t' < LastWait 


Decision ^ nil 


Clock+t' < LastSS 


Decision = i> 


Clock+t' < LastGS 


Eff: none 


Clock+t' < LastGA 




Clock := Clock+t' 



Figure 6-9: Automaton BPSUCCESS for process i (part 2) 
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BPLEADER; is set). Having only one automata would have eliminated the need of 
such a duplication. However we preferred to separate BPLEADER; and BPSUCCESS; 
because they accomplish different tasks. 

In addition to the code fragments of BPLEADER 8 , BPAGENT; and of BPSUCCESS;, 
we provide here some comments about the messages, the state variables and the 
actions. 

Messages. In this paragraph we describe the messages used for communication 
between the leader i and the agents of a round. Every message m is a tuple of 
elements. The messages are: 

1. "Collect" messages, m = (r, "Collect")^-. This message is sent by the leader of 
a round to announce that a new round, with number r, has been started and 
at the same time to ask for information about previous rounds. 

2. "Last" messages, m = (r,"Last",r', v) h i. This message is sent by an agent to 
respond to a "Collect" message from the leader. It provides the last round r' in 
which the agent has accepted a value, and the value v proposed in that round. 
If the agent did not accept any value in previous rounds, then v is either nil 
or the initial value of the agent and r' is (0, j). 

3. "Begin" messages, m = (r, "Begin", u) 8J . This message is sent by the leader of 
round r to announce the value v of the round and at the same time to ask to 
accept it. 

4. "Accept" messages, m = (r, "Accept" ) h i. This message is sent by an agent to 
respond to a "Begin" message from the leader. With this message an agent 
accepts the value proposed in the current round. 

5. "OldRound" messages, m = (r,"01dRound",r') Jj8 '. This message is sent by an 
agent to respond either to a "Collect" or a "Begin" message. It is sent when the 
agent is committed to reject the round specified in the received message and has 
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the goal of informing the leader about round r' which is the higher numbered 
round for which the agent is committed to reject round r. 

6. "Success" messages, m = ( "Success", u) 8J . This message is sent by the leader 
after a successful round. 

7. "Ack" messages, m =("Ack") Jj8 '. This message is an acknowledgment, so that 
the leader can be sure that an agent has received the "Success" message. 

Automaton BPLEADER;. Variable IamLeader keeps track of whether the process is 
leader; it is updated by actions Leader; and NotLeader;. Variable Mode is used by the 
leader to go through the steps of a round. It is used like a program counter. Variable 
InitValue contains the initial value of the process. This value is set by some external 
agent by means of the Init(u) 8 - action and it is initially undefined. Variable Decision 
contains the value decided by process i. Variable CurRnd contains the number of 
the round for which process i is currently the leader. Variable HighestRnd stores the 
highest round number seen by process i. Variable RndValue contains the value being 
proposed in the current round. Variable RndVFrom is the round number of the round 
from which RndValue has been chosen (recall that a leader sets the value for its round 
to be equal to the value of a particular previous round, which is round RndVFrom). 
Variable RndlnfQuo contains the set of processes for which a "Last" message has 
been received by process i (that is, the info-quorum). Variable RndAccQuo contains 
the set of processes for which an "Accept" message has been received by process i 
(that is, the accepting-quorum). We remark that in the original paper by Lamport, 
there is only one quorum which is fixed in the hrst exchange of messages between the 
leader and the agents, so that only processes in that quorum can accept the value 
being proposed. However, there is no need to restrict the set of processes that can 
accept the proposed value to the info-quorum of the round. Messages from processes 
in the info-quorum are used only to choose a consistent value for the round, and once 
this has been done anyone can accept that value. This improvement is also suggested 
in Lamport's paper. 
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Actions Leader; and NotLeader; are used to update IamLeader. Action Init(u); is 
used by an external agent to set the initial value of a process. Action RndSuccess; is 
used to output the decision. Action NewRound; starts a new round. It sets the new 
round number by increasing the highest round number ever seen. Then action Collect; 
resets to the initial values all the variables that describe the status of the round and 
broadcasts a "Collect" message. Action GatherLast; collects the information sent 
by agents in response to the leader's "Collect" message. This information is the 
number of the last round accepted by the agent and the value of that round. Upon 
receiving these messages, GatherLast; updates, if necessary, variables RndValue and 
RndVFrom. Also it updates the info-quorum of the current round by adding to it the 
agent who sent information. GatherLast; is executed until a majority of the processes 
have sent their own information. When "Last" messages have been collected from a 
majority of the processes, GatherLast; is no longer enabled. If RndValue is defined 
then action BeginCast; is enabled. If RndValue is not defined (and this is possible 
if the leader does not have an initial value and does not receive any value in "Last" 
messages) the leader waits for an initial value before enabling action BeginCast;. 
When an initial value is provided, action Continue; sets RndValue and enables action 
BeginCast;. Action BeginCast; broadcasts a "Begin" message with the value chosen 
for the round. Action GatherAccept; gathers the "Accept" messages. If a majority of 
the processes accept the value of the current round then the round is successful and 
GatherAccept; sets the Decision variable to the value of the current round. When 
variable Decision has been set, action RndSuccess; is enabled and it outputs the 
decision made. Action GatherOldRound; collects messages that inform process i that 
the round previously started by i is "old", in the sense that a round with a higher 
number has been started. Process i can update, if necessary, its HighestRnd variable. 

Automaton BPAGENT;. Variable LastR is the round number of the latest round for 
which process i has sent a "Accept" message. Variable LastV is the value for round 
LastR. Variable Commit specifies the round for which process i is committed and 
thus specifies the set of rounds that process i must reject, which are all the rounds 
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with round number less than Commit. We remark that when an agent commits for 
a round r and sends to the leader of round r a "Last" message specifying the latest 
round r' < r in which it has accepted the proposed value, it is enough that the agent 
commits to not accept the value of any round r" in between r' and r. To make the 
code simpler, when an agents commits for a round r, it commits to reject any round 
r" < r. 

Action LastAccept; responds to the "Collect" message sent by the leader by send- 
ing a "Last" message that gives information about the last round in which the agent 
has been involved. Action Accept; responds to the "Begin" message sent by the 
leader. The agent accepts the value of the current round if it is not rejecting the 
round. In both LastAccept; and Accept; actions, if the agent is committed to reject 
the current round because of an higher numbered round, then a notification is sent 
to the leader so that the leader can update the highest round number ever seen. 

Automaton BPSUCCESS;. Variable Decision contains a copy of the variable Decision 
of BPLEADER;; indeed it is updated when the output action RndSuccess; of BPLEADER; 
is executed. Variable IamLeader has the same function as in BPLEADER;. Variable 
Acked(j) contains a boolean that specifies whether or not process j has sent an ac- 
knowledgment for a "Success" message. Variable Prevsend records the time of the 
previous broadcast of the decision. Variables LastSend, LastWait, LastGA, LastGS, 
LastSS are used to impose the time bounds on the actions. Their use should be clear 
from the code. 

Action RndSuccess; simply takes care of updating the Decision variable and sets 
a time bound for the execution of action SendSuccess;. Action SendSuccess; sends 
the "Success" message, along with the value of Decision to all processes for which 
there is no acknowledgment. Then it sets the time bounds for the re-sending of the 
"Success" message (and also the time bound for the actual sending of the messages, 
since outgoing messages are handled with the use of OutMsgs) . Action Wait; re-enable 
action SendSuccess; after an appropriate time bound. We remark that 3l-\-2nl-\-2d is 
the total time needed to send the "Success" message and get back an '"Ack" message 
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(see Lemma 6.2.21). Action GatherSuccess; handles the receipt ol "Success" messages 
Irom processes that already know the decision and sends an acknowledgment. Action 
Gather Ack 8 - handles the "Ack" messages. 

We remark that automaton BPSUCCESS; needs to be able to measure the passage 
of the time; indeed it is a Clock GTA. 

6.2.3 Partial Correctness 

Let us define the system Sbpx to be the composition of system Scha an d automaton 
BASICPAXOS; for each process i Gl (remember that BASICPAXOS; is the composition 
of automata BPLEADER 8 , BPAGENT; and BPSUCCESS;). In this section we prove the 
partial correctness of Sbpx : we show that in any execution of the system S'bpx? 
agreement and validity are guaranteed. 

For these proofs, we augment the algorithm with a collection Ti of history variables. 
Each variable in Ti is an array indexed by the round number. For every round number 
r a history variable contains some information about round r. In particular the set 
Ti consists of: 

Hleader(r) GlU nil, initially nil (the leader of round r). 

Hvalue(r) GVU nil, initially nil (the value for round r). 

Hf rom(r) £ 1Z U nil, initially nil (the round from which Hvalue(r) is taken). 

Hinf quo(r), subset of X, initially {} (the info-quorum of round r). 

Haccquo(r), subset of X, initially {} (the accepting-quorum of round r). 

Hreject(r), subset of X, initially {} (processes committed to reject round r). 

The code fragments of automata BPLEADER; and BPAGENT; augmented with the 
history variables are shown in Figure 6-10. The figure shows only the actions that 
change history variables. Actions of BPSUCCESS; do not change history variables. 

Initially, when no round has been started yet, all the information contained in the 
history variables is set to the initial values. All but Hreject(r) history variables of 
round r are set by the leader of round r, thus if the round has not been started these 
variables remain at their initial values. More formally we have the following lemma. 
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BPLEADERj- Actions: 


BPAGENTj- Actions: 


input NewRoundj- 


internal LastAccept 8 - 


Eff: if Status = alive then 


Pre: 


Status = alive 


CurRnd := HighestRnd + 1 




m = (r, "Collect" )j 8 ' £ InMsgs 


• Rlea.der( CurRnd):=i 


Eff: 


remove all copies of m from InMsgs 


HighestRnd := CurRnd 




if r > Commit then 


Mode := collect 




Commit := r 

• For all r' , lastR < r' < r 


output BeginCastj- 




• Hreject(r') := Hreject(r') U {i} 


Pre: Status = alive 




put (r, "Last" ,LastR,LastV)i j 


Mode = begincast 




in OutMsgs 


Eff: Vj put (CurRnd, "Begin" ,RndValue)ij 




else 


in OutMsgs 




put (r, "OldRound" , Commii)ij 


• Hinf quo(CurRnd) := RndlnfQuo 




in OutMsgs 


• Hf rom(CurRnd) := RndVFrom 






• Hvalue(CMr_Rwrf) := RndValue 






Mode := gatheraccept 






internal GatherAcceptj- 






Pre: Status = alive 






Mode = gatheraccept 






m = (r, "Accept" )j 8 ' £ InMsgs 






CurRnd = r 






Eff: remove all copies of m from InMsgs 






RndAccQuo := RndAccQuo Ujj} 






if |i?riGL4ccQMo| > n/2 then 






Decision := RndValue 






• Haccquo(CMr_Rwrf):= RndAccQuo 






Mode := decide 







Figure 6-10: Actions of BPLEADER; and BPAGENT; for process i augmented with 
history variables. Only the actions that do change history variables are shown. Other 
actions are the same as in BPLEADER; and BPAGENTj, i.e. they do not change history 
variables. Actions of BPSUCCESS; do not change history variables. 
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Lemma 6.2.2 In any state of an execution of Sbpx, (f Hleader(r) = nil then 

Hvalue(r) = nil 

Hfrom(r) = nil 

Hinfquo(r) = {} 

Haccquo(r) = {}. 

Proof: By an easy induction. I 

Given a round r, Hreject(r), is modified by all the processes that commit themselves 
to reject round r, and we know nothing about its value at the time round r is started. 
Next we define some key concepts that will be instrumental in the proofs. 

Definition 6.2.3 In any state of the system S'bpX; a round r is said to be "dead" if 
|Hreject(r)| > n/2. 

That is, a round r is dead if at least n/2 of the processes are rejecting it. Hence, if a 
round r is dead, there cannot be a majority of the processes accepting its value, i.e., 
round r cannot be successful. 

Definition 6.2.4 The set TZs is the set {r £ 7\!.|Hleader(r) ^ nil}. 

That is, TZs is the set of rounds that have been started. A round r is formally started 
as soon as its leader Hleader(r) is defined by the NewRound; action. 

Definition 6.2.5 The set TZy is the set {r £ 7\!.|Hvalue(r) ^ nil}. 

That is, TZy is the set of rounds for which the value has been chosen. 

Invariant 6.2.6 In any state s of an execution o/S'bpX; we have that TZy C TZs- 

Indeed for any round r, if Hleader(r) is nil, by Lemma 6.2.2 we have that Hvalue(r) 
is also nil. Hence Hvalue(r) is always set after Hleader(r) has been set. 

Next we formally define the concept of anchored round which is crucial to the 
proofs. Informally a round r is anchored if its value is consistent with the value 
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chosen in any previous round r' . Consistent means that either the value of round r 
is equal to the value of round r' or round r' is dead. Intuitively, it is clear that if all 
the rounds are either anchored or dead, then agreement is satisfied. 

Definition 6.2.7 A round r £ IZy is said to be "anchored" if for every round r' £ IZy 
such that r' < r, either round r' is dead or Hvalue(r') = Hvalue(r). 

Next we prove that Sbpx guarantees agreement, by using a sequence of invariants. 
The key invariant is Invariant 6.2.13 which states that all rounds are either dead or 
anchored. The hrst invariant captures the fact that when a process sends a "Last" 
message in response to a "Collect" message for a round r, then it commits to not vote 
for rounds previous to round r. 

Invariant 6.2.8 In any state s of an execution o/SppX; if message (r, "Last" ',r'\v) h i 
is in OutMsgSj, then j £ Hreject(r') ; for all r' such that r" < r' < r . 

Proof: We prove the invariant by induction on the length k of the execution a. The 
base is trivial: if k = then a = s , and in the initial state no messages are in 
OutMsgSj. Hence the invariant is vacuously true. For the inductive step assume that 
the invariant is true for a = so^iSi...TVkSk and consider the execution s 7TiSi...7r/ ; s/ ; 7rs. 
We need to prove that the invariant is still true in s. We distinguish two cases. 

CASE 1. In state Sfc, message (r,"Last",r", v) h i is in OutMsgs-. In this case, by 
the inductive hypothesis, in state s^ we have that j £ Hreject(r'), for all r' such 
that r" < r' < r. Since no process is ever removed from any Hreject set, then also 
in state s we have that j £ Hreject(r'), for all r' such that r" < r' < r. 

CASE 2. In state s/;, message (r,"Last",r", v) h i is not in OutMsgs j. Since message 
(r,"Last",r", v) h i is in OutMsgs- in state s, it must be that tt = LastAcceptj and that 
Sk-LastR = r" . Then the invariant follows by the code of LastAcceptj which puts 
process j into Hreject(r') for all r' such that r" < r' < r. I 

The next invariant states that the commitment made by an agent when sending 
a "Last" message is still in effect when the message is in the communication channel. 
This should be obvious, but to be precise in the rest of the proof we prove it formally. 
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Invariant 6.2.9 In any state s of an execution of Sbpx, if message (r, "Last" ',r'\v) h i 
is in CHANNEL^ then j £ Hreject(r') ; for all r' such that r" < r' < r . 

Proof: We prove the invariant by induction on the length k of the execution a. 
The base is trivial: if k = then a = s , and in the initial state no messages are in 
CHANNEL^. Hence the invariant is vacuously true. For the inductive step assume that 
the invariant is true for a = s TTiSi...TTkSk and consider the execution s TTiSi...TTkSkTrs. 
We need to prove that the invariant is still true in s. We distinguish two cases. 

CASE 1. In state Sfc, message (r,"Last",r", v) h i is in CHANNEL^. In this case, 
by the inductive hypothesis, in state Sk we have that j £ Hreject(r'), for all r' such 
that r" < r' < r. Since no process is ever removed from any Hreject set, then also 
in state s we have that j £ Hreject(r'), for all r' such that r" < r' < r. 

CASE 2. In state Sk, message (r,"Last",r", v) h i is not in CHANNEL^. Since 
message (r,"Last",r", v) h i is in CHANNEL^ in state s, it must be that tt = Send(77i) Jj8 ' 
with m = (r,"Last",r", v) h i. By the precondition of action Send(77i) Jj8 ' we have that 
message (r,"Last",r", v) h i is in OutMsgs rj in state Sk- By Invariant 6.2.8 we have that 
in state Sk process j £ Hreject(r') for all r' such that r" < r' < r. Since no process is 
ever removed from any Hrej ect set, then also in state s we have that j £ Hrej ect(r'), 
for all r' such that r" < r' < r . I 

The next invariant states that the commitment made by an agent when sending 
a "Last" message is still in effect when the message is received by the leader. Again, 
this should be obvious. 

Invariant 6.2.10 In any state s of an execution o/S'bpX; if message (r, "Last",r", v) h i 
is in InMsgSj, then j £ Hreject(r') ; for all r' such that r" < r' < r . 

Proof: We prove the invariant by induction on the length k of the execution a. The 
base is trivial: if k = then a = s , and in the initial state no messages are in 
InMsgs^ Hence the invariant is vacuously true. For the inductive step assume that 
the invariant is true for a = so'K\S\...'KkSk an d consider the execution so'K\S\...'KkSk'KS. 
We need to prove that the invariant is still true in s. We distinguish two cases. 
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CASE 1. In state Sfc, message (r,"Last",r", v) h i is in InMsgSj. In this case, by the 
inductive hypothesis, in state Sk we have that j £ Hreject(r'), for all r' such that 
r" < r' < r. Since no process is ever removed from any Hreject set, then also in 
state s we have that j £ Hreject(r'), for all r' such that r" < r' < r. 

CASE 2. In state Sk, message (r,"Last",r", v) h i is not in InMsgSj. Since message 
(r,"Last",r", v) h i is in InMsgs i in state s, it must be that tt = Receive(m) 8J with 
m = (r,"Last",r", v) h i. By the effect of action Receive(m) 8J we have that message 
(r,"Last",r", v) h i is in CHANNEL^- in state Sk- By Invariant 6.2.9 we have that in state 
Sk process j £ Hreject(r') for all r' such that r" < r' < r. Since no process is ever 
removed from any Hreject set, then also in state s we have that j £ Hreject(r'), for 
all r' such that r" < r' < r . I 

The following invariant states that the commitment of the agent is still in effect 
when the leader updates its information about previous rounds using the agents' 
"Last" messages. 

Invariant 6.2.11 In any state s of an execution S'bpX; if process j £ RndInfQuo i} 
for some process i, and CurRncf = r, then W such that s.RndVFronii < r' < r , we 
have that j £ Hreject(r'). 

Proof: We prove the invariant by induction on the length k of the execution a. The 
base is trivial: if k = then a = s , and in the initial state no process j is in 
RndInfQuo i for any i. Hence the invariant is vacuously true. For the inductive step 
assume that the invariant is true for a = soTTiSi...TVkSk an d consider the execution 
5o7ri5i...7rfc5fc7r5. We need to prove that the invariant is still true in s. We distinguish 
two cases. 

CASE 1. In state Sk, j £ RndInfQuo i} for some process z, and CurRncf = r. 
Then by the inductive hypothesis, in state Sk we have that j £ Hreject(r'), for all 
r' such that Sk-RndVFromi < r' < r. Since no process is ever removed from any 
Hreject set and, as long as CurRncf is not changed, variable RnclVFronii is never 
decreased, then also in state s we have that j £ Hreject(r'), for all r' such that 
s.RndVFronii < r' < r. 
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CASE 2. In state Sk, it is not true that j £ RndInfQuo i} for some process z, and 
CurRndi = r. Since in state s it holds that j £ RndInfQuo i} for some process z, and 
CurRndi = r, it must be the case that tt = GatherLast; and that m in the precon- 
dition of GatherLast; is m = (r,"Last",r", v) h i. Notice that, by the precondition of 
GatherLast 8 ', m £ InMsgs^ Hence, by Invariant 6.2.10 we have that j £ Hreject(r'), 
for all r' such that r" < r' < r. By the code of the GatherLast; action we have that 
RndVFrorrii > r" . Whence the invariant is proved. I 

The following invariant is basically the previous one stated when the leader has 
fixed the info-quorum. 

Invariant 6.2.12 In any state of an execution o/SbpXj tf j £ Hinfquo(r) then W 
such that Hfrom(r) < r' < r , we have that j £ Hreject(r'). 

Proof: We prove the invariant by induction on the length k of the execution a. 
The base is trivial: if k = then a = s , and in the initial state we have that for 
every round r, Hleader(r) = nil and thus by Lemma 6.2.2 there is no process j in 
Hinf quo(r). Hence the invariant is vacuously true. For the inductive step assume that 
the invariant is true for a = so^iSi...TVkSk and consider the execution s 7TiSi...7r/ ; s/ ; 7rs. 
We need to prove that the invariant is still true in s. We distinguish two cases. 

CASE 1. In state Sfc, j £ Hinfquo(r). By the inductive hypothesis, in state s^ we 
have that j £ Hreject(r'), for all r' such that Hfrom(r) < r' < r. Since no process is 
ever removed from any Hrej ect set, then also in state s we have that j £ Hrej ect(r'), 
for all r' such that Hf rom(r) < r' < r. 

CASE 2. In state s/;, j £" Hinfquo(r). Since in state s, j £ Hinfquo(r), it must 
be the case that action tt puts j in Hinfquo(r). Thus it must be 7r = BeginCast; for 
some process z, and it must be Sk -CurRndi = r and j £ Sk-RndlnfQuoi. Since action 
BeginCast; does not change CurRndi and RndInfQuo i we have that s. CurRndi = r 
and j £ s.RndlnfQuo^ By Invariant 6.2.11 we have that j £ Hreject(r') for all r' 
such that s. RndVFrorrii < r' < r. By the code of BeginCast; we have that Hf rom(r) = 
s. RndVFrorrii. I 

We are now ready to prove the main invariant. 
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Invariant 6.2.13 In any state of an execution of Sbpx, an U non-dead round r £ IZy 
is anchored. 

Proof: We proceed by induction on the length k of the execution a. The base is 
trivial. When k = we have that a = s and in the initial state no round has been 
started yet. Thus Hleader(r) = nil and by Lemma 6.2.2 we have that IZy = {} and 
thus the assertion is vacuously true. For the inductive step assume that the assertion 
is true for a = so'K\S\...'KkSk an d consider the execution so'K\S\...'KkSk'KS. We need to 
prove that, for every possible action tt the assertion is still true in state s. First we 
observe that the definition of "dead" round depends only upon the history variables 
and that the definition of "anchored" round depends upon the history variables and 
the definition of "dead" round. Thus the definition of "anchored" depends only on 
the history variables. Thus actions that do not modify the history variables cannot 
affect the truth of the assertion. The actions that change history variables are (see 
code): 

1. 7r = NewRound; 

2. 7r = BeginCast; 

3. 7r = GatherAccept; 

4. 7r = LastAccept; 

CASE 1. Assume tt = NewRound;. This action sets the history variable Hleader(r) 
where r is the round number of the round being started by process i. The new round 
r does not belong to IZy since Hvalue(r) is still undefined. Thus the assertion of the 
lemma cannot be contradicted by this action. 

CASE 2. Assume tt = BeginCast;. Action tt sets Hvalue(r), Hfrom(r) and 
Hinfquo(r) for some round r. Round r belongs to IZy in the new state s. In order 
to prove that the assertion is still true it suffices to prove that round r is anchored in 
state s and any round r', r' > r is still anchored in state s (notice that rounds with 
round number less than r are still anchored in state s, since the definition of anchored 
for a given round involves only rounds with smaller round numbers). 

87 



First we prove that round r is anchored. From the precondition of BeginCast; we have 
that Hinf quo(r) contains more than n/2 processes; indeed variable Mode is equal to 
begincast only if the cardinality of RndlnfQuo is greater than n/2. Using Invariant 
6.2.12 for each process j in Hinfquo(r), we have that for every round r', such that 
Hfrom(r) < r' < r, there are more than n/2 processes in the set Hreject(r'), which 
means that every round r' is dead. Since Hvalue(Hf rom(r)) = Hvalue(r), round r is 
anchored in state s. 

Finally, we need to prove that any non-dead round r', r' > r that was anchored 
in Sk is still anchored in s. Since action BeginCast; modifies only history variables 
for round r, we only need to prove that in state s, Hvalue(r') = Hvalue(r). Let r" 
be equal to Hfrom(r). Since r' is anchored in state Sk we have that Sfc.Hvalue(r') = 
s/;.Hvalue(r"). Again because BeginCast; modifies only history variables for round 
r, we have that s.Hvalue(r') = s.Hvalue(r"). But we have proved that round r is 
anchored in state s and thus s.Hvalue(r) = s.Hvalue(r"). Hence s.Hvalue(r') = 
s.Hvalue(r). 

CASE 3. Assume tt = Gather Accept; . This action modifies only variable Haccquo, 
which is not involved in the definition of anchored. Thus this action cannot make the 
assertion false. 

CASE 4. Assume tt = LastAccept;. This action modifies Hinf quo and Hreject. 
Variable Hinf quo is not involved in the definition of anchored. Action LastAccept; 
may put process i in Hreject of some rounds and this, in turn, may make those 
rounds dead. However this cannot make false the assertion; indeed if a round r was 
anchored in Sk it is still anchored when another round becomes dead. I 

The next invariant follows easily from the previous one and gives a more direct 
statement about the agreement property. 

Invariant 6.2.14 In any state of an execution of S'bpX; a ^ the Decision variables 
that are not nil, are set to the same value. 

Proof: We prove the invariant by induction on the length k of the execution a. The 
base of the induction is trivially true: for k = we have that a = s and in the initial 
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state all the Decision variables are undefined. 

Assume that the assertion is true for a = so^iSi...TVkSk and consider the execution 
5o7ri5i...7rfc5fc7r5. We need to prove that, for every possible action tt the assertion is 
still true in state s. Clearly the only actions which can make the assertion false are 
those that set Decision, for some process i. Thus we only need to consider actions 
Gather Accept; and Gather Success;. 

CASE 1. Assume tt = GatherAccept;. This action sets Decision to Hvalue(r) 
where r is some round number. If all Decision j, j ^ z, are undefined then Decision 
is the hrst decision and the assertion is still true. Assume there is only one Decision.,- 
already defined. Let Decision.,- = Hvalue(r') for some round r' . By Invariant 6.2.13, 
rounds r and r' are anchored and thus we have that Hvalue(r') = Hvalue(r). Whence 
Decisioiii = Decisiorij. If there are some Decisiorij, j ^ z, which are already defined, 
then by the inductive hypothesis they are all equal. Thus, the lemma follows. 

CASE 2. Assume tt = GatherSuccess;. This action sets Decisioiii to the value 
specified in the "Success" message that enabled the action. It is easy to see (by 
the code) that the value sent in a "Success" message is always the Decision of some 
process. Thus we have that Decisioiii is equal to Decisiorij for some other process j 
and by the inductive hypothesis if there is more than one Decision variable already 
set they are all equal. I 

Finally we can prove that agreement is satisfied. 
Theorem 6.2.15 In any execution of the system S'bpX; agreement is satisfied. 

Proof: The theorem follows easily by Invariant 6.2.14. I 

Validity is easier to prove since the value proposed in any round comes either from 
a value supplied by an Init(?j) 8 - action or from a previous round. 

Invariant 6.2.16 In any state of an execution a o/SbpXj f or an U r £ 'R-v we have 
that Hvalue^r) £ V a . 

Proof: We proceed by induction on the length k of the execution a. The base of the 
induction is trivially true: for k = we have that a = s and in the initial state all 
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the Hvalue variables are undefined. 

Assume that the assertion is true for a = so^iSi...TVkSk and consider the execution 
5o7ri5i...7rfc5fc7r5. We need to prove that, for every possible action tt the assertion is 
still true in state s. Clearly the only actions that can make the assertion false are 
those that modify Hvalue. The only action that modifies Hvalue is BeginCast. Thus, 
assume tt = BeginCast;. This action sets Hvalue(r) to RndValuei. We need to prove 
that all the values assigned to RndValuei are in the set V a . Variable RndValuei is 
modified by actions NewRound; and GatherLast;. We can easily take care of action 
NewRound; because it simply sets RndValuei to be InitValuei which is obviously in 
V a . Thus we only need to worry about GatherLast; actions. A GatherLast; action 
sets variable RndValuei to the value specified into the "Last" message if that value 
is not nil. By the code, it is easy to see that the value specified into any "Last" 
message is either nil or the value Hvalue(r') of a previous round r'; by the inductive 
hypothesis we have that Hvalue(r') belongs to V a . I 

Invariant 6.2.17 In any state of an execution of SppX; a ^l the Decision variables 
that are not undefined are set to some value in V a . 

Proof: A variable Decision is always set to be equal to Hvalue(r) for some r. Thus 
the invariant follows from Invariant 6.2.16. I 

Theorem 6.2.18 In any execution of the system SppX; validity is satisfied. 
Proof: Immediate from Invariant 6.2.17. I 

6.2.4 Analysis 

In this section we analyze the performance of Sppx- Since the algorithm may not 
terminate at all when failures happen, we can only prove that if, starting from some 
point in time on, no failures or recoveries happen and there is at least a majority of 
alive processes then termination is achieved within some time bound and with the 
sending of some number of messages. 
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Before turning our attention to the time analysis, let us give the following lemma 
which provides a bound on the number of messages sent in any round. 

Lemma 6.2.19 If an execution fragment of the system SppX; starting in a reachable 
state, is stable then at most 4n messages are sent in a round. 

Proof: In step 1 the leader broadcasts a "Collect" message, thus this counts for n 
messages. Since the execution is stable, no message is duplicated. In step 2, agents 
respond to the "Collect" message. Even though only |_ n /2j + 1 of these responses are 
used by the leader, we need to account for n messages since every process may send 
a "Last" message in step 2. A similar reasoning for steps 3 and 4 leads to a total of 
at most 4n messages. I 

Now we consider the time analysis. Let us begin by making precise the meaning 
of expressions like "the start (end) of a round". 

Definition 6.2.20 In an execution fragment during which process i is the unique 
leader 

• the "start" of a round is the execution of action NewRouncf; 

• the "end" of a round is the execution of action RndSuccessi. 

A round is successful if it ends, that is, if the RndSuccess; action is executed by 
the leader i. Moreover we say that a process i reaches its decision when automaton 
BPSUCCESS; sets its Decision variable. We remark that, in the case of a leader, the 
decision is actually reached when the leader knows that a majority of the processes 
have accepted the value being proposed. This happens in action GatherAccept; of 
BPLEADERj-. However, to be precise in our proofs, we consider the decision reached 
when the variable Decision of BPSUCCESS; is set; for the leader this happens exactly at 
the end of a successful round. Notice that the Decide(u); action, which communicates 
the decision v of process i to the external environment, is executed within £ time from 
the point in time when process i reaches the decision, provided that the execution is 
regular (in a regular execution actions are executed within the expected time bounds). 
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The following lemma states that, once the leader has made a decision, if the 
execution is stable, the decision will be reached by all the alive processes within linear 
(in the number of processes) time and with the sending of at most 2n messages. 

Lemma 6.2.21 If an execution fragment a of the system SppX; starting in a reach- 
able state s and lasting for more than 3£ + 2n£ + 2d time, is stable and there is a 
unique leader, say i, that has reached a decision in state s, then by time 3£ + 2n£ + 2d , 
every alive process j ^ i has reached a decision, and the leader i has Acked(j)i = true 
for every j ^ i. Furthermore, at most 2n messages are sent. 

Proof: First notice that Sppx is the composition of CHANNEL^- and other automata. 
Hence, by Theorem 2.6.10 we can apply Lemma 3.2.3. Let i be the leader. By as- 
sumption, Decisiorii of BPSUCCESS; is not nil in state s. By the code of BPSUCCESS;, 
action SendSuccess; is executed within £ time. This action puts at most n messages 
into the OutMsgsi set. Action Send 8J is enabled until all of them have been actually 
sent over the channels. This takes at most n£ time. By Lemma 3.2.3 each alive pro- 
cess j receives the "Success" message, i.e., executes a Receive( "Success", v)ij action, 
within d time. By Lemma 2.5.4, action GatherSuccess; will be executed within ad- 
ditional £ time. This action sets variable Decision^ and puts an "Ack" message into 
OutMsgsj. At this point all alive processes have reached a decision. Within £ time 
the "Ack" message is actually sent over CHANNEL^-. Then, again by Lemma 3.2.3 
this "Ack" message is received by process z, i.e., action Receive("Ack") Jj8 ' is executed, 
within d time. Within at most n£ time all "Ack" messages are processed by action 
Ack;. At this point the leader knows that all alive processes have reached a decision, 
and will not send any other message to them. The time bound is obtained by adding 
the above time bounds. We account for 2n messages since the leader sends a "Suc- 
cess" message to every process and for each of these message an acknowledgment is 
sent. I 

In the following we will be interested in the time analysis from the start to the end 
of a successful round. Hence we consider an execution fragment a having a unique 
leader, say process i and such that the leader i has started a round by the hrst state 
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of a (that is, in the hrst state of a, CurRndi = r for some round number r). 

We remark that in order for the leader to execute step 3, i.e., action BeginCast;, 
it is necessary that RndValue be defined. If the leader does not have an initial value 
and no agent sends a value in a "Last" message, variable RndValue is not defined. In 
this case the leader needs to wait for the execution of the Init(u) 8 - to set a value to 
propose in the round (see action Continue;). Clearly the time analysis depends on the 
time of occurrence of the Init(u) 8 -. To deal with this we use the following definition. 

Definition 6.2.22 Given an execution fragment a, we define t l a to he 

• ; if InitValuei is defined in the first state of a; 

• the time of occurrence of action Init(v)i, if variable InitValuei is undefined in 
the first state of a and action Init(y)i is executed in a; 

• infinite, if variable InitValuei is undefined in the first state of a and no Init(y)i 
action is executed in a. 

Moreover, we define T l a to be max{4£ + 2n£ + 2d } P a + 2£} . 

We are now ready to provide a time analysis for a successful round. We hrst 
provide a simple lemma that gives a bound for the time that elapses between the 
execution of the BeginCast action and the RndSuccess action for a successful round 
in a stable execution fragment. Notice that action BeginCast for a round r sets history 
variable Hvalue(r); hence the fact that in a particular reachable state s we have that 
s.Hvalue(r) ^ nil means that for any execution that brings the system into state s 
action BeginCast for round r has been executed. 

Lemma 6.2.23 Suppose that for an execution fragment a of the system S'bpX; start- 
ing in a reachable state s in which s. Decision = nil, it holds that: 

(i) a is stable; 

(ii) in a there exists a unique leader, say process i; 

(Hi) a lasts for more than 3£ + 2n£ + 2d time; 
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(iv) s.CurRndi = r, for some round number r, and s.Hvalue(r) ^ nil; 

(v) round r is successful. 

Then we have that action RndSuccessi is performed by time 3£ + 2n£ + 2d from the 
beginning of a. 

Proof: First notice that Sppx is the composition of CHANNEL^- and other automata. 
Hence, by Theorem 2.6.10 we can apply Lemmas 2.5.4 and 3.2.3. Since the execution 
is stable, it is also regular, and thus by Lemma 2.5.4 actions of BPLEADER; and 
BPAGENT; are executed within £ time and by Lemma 3.2.3 messages are delivered 
within d time. 

Variable Hvalue(r) is set when action BeginCast for round r is executed. Since 
Hvalue(r) is defined, "Begin" messages for round r have been put in OutMsgs^ In at 
most n£ time action Send 8J is executed for each of these messages, and the "Begin" 
message is delivered to each agent j, i.e., action Receive; j is executed, within d time. 
Then, the agent executes action Accept.,- within £ time. This action puts the "Accept" 
message in OutMsgs r Action Send Jj8 - for this message is executed within £ time and 
the message is delivered, i.e., action Receive^- for that message is executed, within d 
time. Since the round is successful there are more than n/2 such messages received by 
the leader. To set the decision action GatherAccept; must be executed for |_ n /2j + 1 
"Accept" messages. This is done in less than n£ time. At this point the Decision 
variable is defined and action RndSuccess; is executed within £ time. Summing up 
all the times we have that the round ends within 3£ + 2n£ + 2d. I 

The next lemma provides a bound on the time needed to complete a successful 
round in a stable execution fragment. 

Lemma 6.2.24 Suppose that for an execution fragment a of the system SppX; start- 
ing in a reachable state s in which s. Decision = nil, it holds that: 

(i) a is stable; 

(ii) in a there exists a unique leader, say process i; 
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(Hi) a lasts for more than T l a + 3£ + 2n£ + 2d time; 
(iv) s.CurRndi = r, for some round number r; 
(v) round r is successful. 

Then we have that action RndSuccessi is performed by time T l a + 3£ + 2n£ + 2d from 
the beginning of a. 

Proof: First notice that Sppx i s the composition of CHANNEL^- and other automata. 
Hence, by Theorem 2.6.10 we can apply Lemmas 2.5.4 and 3.2.3. Since the execution 
is stable, it is also regular, and thus by Lemma 2.5.4 actions of BPLEADER; and 
BPAGENT; are executed within £ time and by Lemma 3.2.3 messages are delivered 
within d time. 
To prove the lemma, we distinguish two possible cases. 

CASE 1. s.Hvalue(r) ^ nil. 
By Lemma 6.2.23 action RndSuccess; is executed within 3£ + 2n£ + 2d time from the 
beginning of a. 

CASE 2. s.Hvalue(r) = nil. We hrst prove that action BeginCast; is executed 
by time T l a from the beginning of a. 

Since s.CurRndi = r, it takes at most £ time for the leader to execute action Collect;. 
This action puts n "Collect" messages, one for each agent j, into OutMsgsi. In at 
most n£ time action Send 8J is executed for each of these messages, and the "Collect" 
message is delivered to each agent j, i.e., action Receive^- is executed, within d time. 
Then it takes £ time for an agent to execute action LastAcceptj which puts the 
"Last" message in OutMsgs^ and £ time to execute action Send Jj8 - for that message. 
The "Last" message is delivered to the leader, i.e., action Receive^- is executed, 
within d time. Since the round is successful at least a majority of the processes send 
back to the leader a "Last" message in response to the "Collect" message. Action 
GatherLastj-, which handles "Last" messages, is executed for |_ n /2j + 1 messages; this 
is done within at most n£ time. 

At this point there are two possible cases: (z) RndValue is defined and (ii) 
RndValue is not defined. In case (z), action BeginCast; is enabled and is executed 

95 



within £ time. Summing up the times considered so far we have that action BeginCast; 
is executed within 4£ + 2n£ + 2d time from the start of the round. In case (z'z), action 
Continue; is executed within t l a + £ time; this action enables action BeginCast; which 
is executed within additional £ time. Hence action BeginCast; is executed by time 
t l a + 21. Putting together the two cases we have that action BeginCast; is executed 
by time max{4£ + 2n£ + 2d, f a + 2£}. 

Hence we have proved that action BeginCast; is executed in a by time T l a . 

Let a' be the fragment of a starting after the execution of the BeginCast; action. 
By Lemma 6.2.23 action RndSuccess; is executed within 3£ + 2n£ + 2d time from the 
beginning of a'. Since action BeginCast; is executed by time T l a in a we have that 
action RndSuccess; is executed by time T l a + 3£ + 2n£ + 2d in a. I 

Lemmas 6.2.19, 6.2.21 and 6.2.24, state that if in a stable execution a successful 
round is conducted, then it takes a linear, in n, amount of time and a linear, in 
n, number of messages to reach consensus. However it is possible that even if the 
system executes nicely from some point in time on, no successful round is conducted 
and to have a successful round a new round must be started. We take care of this 
problem in the next section. We will use a more refined version of Lemma 6.2.24; this 
refined version replaces condition (v) of Lemma 6.2.24 with a weaker requirement. 
This weaker requirement is enough to prove that the round is successful. 

Lemma 6.2.25 Suppose that for an execution fragment a of S'bpX; starting in a 
reachable state s in which s. Decision = nil. it holds that: 



(i 



\n 



\m 



(IV 



a is nice; 

in a there exists a unique leader, say process i; 

a lasts for more than T l a + 3£ + 2n£ + 2d time; 

s.CurRndi = r, for some round number r; 

there exists a set J CI of processes such that every process in J is alive and 
J is a majority, for every j £ J , s.Commitj < r and for every j £ J and 
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k £ X, CHANNEL^ j does not contain any "Collect" message belonging to any 



round r' > r. 



Then we have that action RndSuccessi is performed by time T l a + 3£ + 2n£ + 2d from 
the beginning of a. 

Proof: In state s, process i is the unique leader in a and since s.CurRndi = r, round 
r has been started by i. Hence process i sends a "Collect" message which is delivered 
to all the alive voters. All the alive voters, and thus all the processes in J7", respond 
with "Last" messages which are delivered to the leader. No process j ' £ J can be 
committed to reject round r. Indeed, by assumption, process j is not committed to 
reject round r in state s; moreover process j cannot receive a "Collect" message that 
forces it to commit to reject round r since, by assumption, no such a message is in 
any channel to process j in s and in a the only leader is i which only sends messages 
belonging to round r. Since J is a majority, the leader receives at least a majority 
of "Last" messages and thus it is able to proceed with the next step of the round. 
The leader sends a "Begin" message which is delivered to all the alive voters. All the 
alive voters, and thus all the processes in J7", respond with "Accept" messages since 
they are not committed to reject round r. Since J is a majority, the leader receives 
at least a majority of "Accept" messages. Therefore round r is successful. Thus we 
can apply Lemma 6.2.24. By Lemma 6.2.24 action RndSuccess; is performed within 
T % a + M + 2n£ + 2d time. ■ 

6.3 Automaton STARTERALG 

To reach consensus using S'bpx? rounds must be started by an external agent by 
means of the NewRound; action that makes process i start a new round. The system 
S'bpx guarantees that running rounds does not violate agreement and validity, even 
if rounds are started by many processes. However since running a new round may 
prevent a previous one from succeeding, initiating too many rounds is not a good idea. 
The strategy used to initiate rounds is to have a leader election algorithm and let the 
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leader initiate new rounds until one round is successful. We exploit the robustness of 
BASICPAXOS in order to use the sloppy leader elector provided in Chapter 5. As long 
as the leader elector does not provide exactly one leader, it is possible that no round 
is successful, however agreement and validity are always guaranteed. Moreover, when 
the leader elector provides exactly one leader, if the system Sbpx lS executing a nice 
execution fragment 3 then a round is successful. 

Once a process is leader, it must start rounds until one of them is successful or until 
it is no longer leader. When a process i becomes leader it starts a round. However 
due to crashes of other processes or due to already started rounds, the round started 
by i may not succeed. In this case the leader must start a new round. 

Figure 6-11 shows a Clock GT automaton STARTERALG; for process i. This au- 
tomaton interacts with LEADERELECTOR; by means of the Leader; and NotLeader; 
actions and with BASICPAXOS; by means of the NewRound; BeginCast;, RndSuccess; 
actions. Figure 6-1, given at the beginning of the chapter, shows the interaction of 
the STARTERALG; automaton with the other automata. 

Automaton STARTERALG; updates the flag IamLeader according to the input ac- 
tions Leader; and NotLeader; and executes the other actions whenever it is the leader. 
Flag Start is used to start a new round and it is set either when a Leader; action 
changes the leader status IamLeader from false to true, that is, when the process 
becomes leader, or when action RndSuccess; is not executed within the expected 
time bound. Flag RndSuccess is updated by the input action RndSuccess;. Ac- 
tion NewRound; starts a new round. Action CheckRndSuccess; checks whether the 
round is successful within the expected time bound. This time bound depends on 
whether the leader has to wait for an Init(u); event. However by Lemma 6.2.23 action 
RndSuccess; is expected to be executed within 3£ + 2n£ + 2d time from the time of 
occurrence of action BeginCast;. When action BeginCast; is executed, the above time 
bound is set. Action CheckRndSuccess; starts a new round if the previous one does 
not succeed within the expected time bound. 



3 Recall that in a nice execution fragment there are no failures or recoveries and a majority of the 
processes are alive. See definition at the end of Chapter 3. 
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STARTERALG; 






Signature: 






Input: Leader^, NotLeader 8 ', BeginCastj- 


RndSuccesSj- , Stop 8 - , Recover^ 


Internal: CheckRndSuccesSj- , v{t) 




Output: NewRoundj- 






Time-passage: v{t) 






States: 






Clock G M 


initially arbitrary 


Status G {alive, stopped} 


initially alive 




IamLeader, a boolean 


initially false 




Start, a boolean 


initially false 




Deadline G 1 U {nil} 


initially nil 




LastNR £1U {00} 


initially 00 




Last GMU{oo} 


initially 00 




RndSuccess, a boolean 


initially false 




Actions: 






input Stopi 




output NewRoundj- 


Eff: Status := stopped 




Pre: Status = alive 

LamLeader = true 


input Recover^ 




Start = true 


Eff: Status := alive 




Eff: Start := false 

LastNR := 00 


input Leader^ 






Eff: if Status = alive then 




internal CheckRndSuccesSj- 


if LamLeader = false 


then 


Pre: Status = alive 


LamLeader = true 




LamLeader =true 


if RndSuccess = f al 


se then 


Deadline ^ nil 


Deadline := nil 




Clock > Deadline 


Start := true 




Eff: Last := 00 


LastNR := Clock + £ 


if RndSuccess = false then 






Start := true 


input NotLeader 8 - 




LastNR := Clock +£ 


Eff: if Status = alive then 






LamLeader := false 




time-passage v{t) 
Pre: Status = alive 


input BeginCastj- 




Eff: Let t' be s.t. Clock + t' < Last 


Eff: if Status = alive then 




and Clock + t' < LastNR 


Deadline := Clock +M+ 2n£ + 2d 


Clock := Clock + t' 


input RndSuccess(i>) 8 ' 






Eff: if Status = alive then 






RndSuccess := true 






Last := 00 







Figure 6-11: Automaton STARTERALG for process i 
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6.4 Correctness and analysis 

Even in a nice execution fragment a round may not reach success. However in that 
case a new round is started and there is nothing that can prevent the success of the 
new round. Indeed in the newly started round, alive processes are not committed for 
higher numbered rounds since during the hrst round they inform the leader of the 
round number for which they are committed and the leader, when starting a new 
round, always uses a round number greater than any round number ever seen. Thus 
in the newly started round, alive process are not committed for higher numbered 
rounds and since the execution is nice the round is successful. In this section we will 
formally prove the above statements. 

Let Spax be the system obtained by composing system Slea with one automa- 
ton BASICPAXOS; and one automaton STARTERALG; for each process i £ X. Since 
this system contains as a subsystem the system Sppx, it guarantees agreement and 
validity. However, in a long enough nice execution fragment of Spax termination is 
achieved, too. 

The following lemma states that in a long enough nice execution fragment with 
a unique leader, the leader reaches a decision. We recall that T l a = max{4£ + 2n£ + 
2 d, t l a + 2£} and that t l a is the time of occurrence of action Init(u) 8 - in a (see Defini- 
tion 6.2.22). 

Lemma 6.4.1 Suppose that for an execution fragment a of SpAX, starting in a reach- 
able state s in which s. Decision = nil, it holds that 

(i) a is nice; 

(ii) there is a unique leader, say process i; 

(Hi) a lasts for more than T l a + Yll + 6n£ + 7d time. 

Then by time T l a + Yll + 6n£ + 7d the leader i has reached a decision. 

Proof: First we notice that system Spax contains as subsystem S'bpx! hence by using 
Theorem 2.6.10, the projection of a on the subsystem Sppx is actually an execution 
of Sppx an d thus Lemmas 6.2.24 and 6.2.25 are still true in a. 
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Let s' be the first state of a such that all the messages that are in the channels in 
state s are not anymore in the channels in state s' and such that s'.CurRnd is defined. 

State s' exists in a and its time of occurrence is less or equal to max{ d, £}. Indeed, 
since the execution is nice, all the messages that are in the channels in state s are 
delivered by time d and if CurRnd is not defined in state s then, by the code of 
STARTERALGj-, since i is leader in a, action NewRound; is executed by time £ of the 
beginning of a. 

In state s', for every alive process j and for every k, CHANNEL^.,- does not contain 
any "Collect" message belonging to any round not started by process i. Indeed, since 
i is the unique leader in a, "Collect" messages sent during a are sent by process i and 
other "Collect" message possibly present in the channels in state s are not anymore 
in the channels in state s' . 

Let r be the number of the latest round started by process i by state s', that is, 
s'.CurRndi = r. 

Let a' be the fragment of a beginning at s' . Since a' is a fragment of a, we have 
that a! is nice and process i is the unique leader in a'. 

We now distinguish two possible cases. 

CASE 1. Round r is successful. In this case, by Lemma 6.2.24 the round is 
successful within T l a , + 3£ + 2n£ + 2d time in a' . Noticing that T l a , < T l a and that 
max{o?, £} < d + £, we have that the round is successful within T l a + 4£ + 2n£ + 3d 
time in a. Thus the lemma is true in this case. 

CASE 2. Round r is not successful. 

By the code of STARTERALG;, action NewRound; is executed within T l a , + 4£ + 
2n£+2o? time in a 1 (it takes T l a , -\-3£-\-2n£-\-2d to execute action CheckRndSuccess; and 
additional £ time to execute action NewRound;). Let r new be the new round started 
by i in action NewRound;, let s" be the state of the system after the execution of 
action NewRound; and let a" be the fragment of a 1 beginning at s". 

Clearly a" is nice, process i is the unique leader in a" and s" .CurRndi = r new . 

Any alive process j that rejected round r because of a round r', r' > r, has 
responded to the "Collect" message of round r, with a message (r,"01dRound",r') Jj8 ' 
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informing the leader i about round r' . Since a' is nice all the "OldRound" messages 
are received before state s" . Since action NewRound; uses a round number greater 
than all the ones received in "OldRound" messages, we have that for any alive process 
j, s" . Commit j < r new . 

Let J be the set of alive processes. From what is argued above any process 
j G J has s" '.Commit j < r new . Moreover in state s", for every j ' G J and any k G X, 
CHANNELfcj does not contain any "Collect" message belonging to any round r' > r new 
(indeed "Collect" messages sent in a are sent only by the unique leader i and we have 
already argued that any other "Collect" message is delivered before state s'). Finally 
since a is nice, by definition of nice execution fragment, we have that J contains a 
majority of the processes. 

Hence we can apply Lemma 6.2.25 to the execution fragment a" . Moreover for 
a" we have that T^„ = 4£ + 2n£ + 2d (indeed we assumed that round r is not suc- 
cessful and this can only happen when an initial value has been provided). Hence by 
Lemma 6.2.25, round r new is successful within 7£ + Ani + 4 d time from the beginning 
of a". Summing up the time bounds and using max{o?, £} < d + £ and T l a , < T^, we 
have that the lemma is true also in this case. I 

If the execution is stable for enough time, then the leader election eventually elects 
a unique leader. In the following theorem we consider a nice execution fragment a 
and we let i be the process eventually elected unique leader. We remark that before i 
is elected leader several processes may consider themselves leaders. Hence, as a worst- 
case scenario, we have that before i becomes the unique leader, all the processes may 
act as leaders and may send messages. In the message analysis we do not count 
any message m sent before i becomes the unique leader and also we do not count a 
response to such a message m (in the worst-case scenario, these messages can be as 
many as 0(n 2 )). We also recall that, for any z, t l a denotes the time of occurrence of 
action Init(u) 8 - if this action occurs in a (see Definition 6.2.22). 

Theorem 6.4.2 Let a be a nice execution fragment of SpAX starting in a reachable 
state and lasting for more than t l a + 24£ + 10n£ + 13d. Then the leader i executes 
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Decide(v')i by time t l a + 211 + 8n£ + lid from the beginning of a and at most 8n 
messages are sent. Moreover by time t l a + 24£ + 10n£ + 13c? from the beginning of a 
any alive process j executes Decide(v')j and at most 2n additional messages are sent. 

Proof: Since Spax contains Slea an d <Sbpx as subsystems, by Theorem 2.6.10 we 
can use any property of Slea an d <Sbpx- Since the execution fragment is nice (and 
thus stable), by Lemma 5.2.2 there will be a unique leader (process i) by time 4£ + 2c?. 
Let s' be the hrst state of a in which there is a unique leader. By Lemma 5.2.2 the 
time of occurrence of s' before or at time 4£ + 2d. Let a' be the fragment of a starting 
in state s' . Since a is nice, a' is nice. 

By Lemma 6.4.1 we have that the leader reaches a decision by time T l a , + Yll + 
6n£ + 7c? from the beginning of a' . Summing up the times and noticing that T l a , < 
t l a , + 4£ + 2n£ + 2c? and that t l a , < t l a we have that the leader reaches a decision by 
time t l a + 20£ + 8n£ + lie?. Within additional £ time action Decide(u') 8 - is executed. 
Moreover during a the leader starts at most two rounds and by Lemma 6.2.19 we 
have that at most 4n messages are spent in each round. 

Since the leader reaches a decision by time t l a + 20£ + 8n£ + lie?, by Lemma 6.2.21 
we have that a decision is reached by every alive process j by time t t a -\-23£-\-10n£-\-13d 
with the sending of at most 2n additional messages. Within additional £ time action 
Decide(u')j is executed. I 

6.5 Concluding remarks 

In this chapter we have provided a new presentation of the PAXOS algorithm. The 
PAXOS algorithm was devised in [29]. However, the algorithm seems to be not widely 
known or understood. We conclude this chapter with a few remarks. 

The hrst remark concerns the time analysis. The linear factor in the time bounds 
derives from the fact that a leader needs to broadcast n messages (one for each agent) 
and also has to handle up to n responses that may arrive concurrently. If we assume 
that the broadcasting of a message to n processes takes constant time, and that 
incoming messages can be processed within constant time from their receipt, then 
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all the n£ contributions in the time bounds become £, and the time bounds become 
constants instead of linear functions of the number of processes. 

Another remark is about the use of majorities for info-quorums and accepting- 
quorums. The only property that is used is that there exists at least one process 
common to any info-quorum and any accepting-quorum. Thus any quorum scheme 
for info-quorums and accepting-quorums that guarantees the above property can be 
used. 

The amount of stable storage needed can be reduced to a very few state variables. 
These are the last round started by a leader (which is stored in the CurRnd variable), 
the last round in which an agent accepted the value and the value of that round 
(variables LastR, LastV), and the round for which an agent is committed (variable 
Commit). These variables are used to keep consistency, that is, to always propose 
values that are consistent with previously proposed values, so if they are lost then 
consistency might not be preserved. In our setting we assumed that the entire state of 
the processes is in stable storage, but in a practical implementation only the variables 
described above need to be stable. 

We remark that a practical implementation of PAXOS should cope with some 
failures before abandoning a round. For example a message could be sent twice, 
since duplication is not a problem for the algorithm (it may only affect the message 
analysis), or the time bound checking may be done later than the earliest possible 
time to allow some delay in the delivery of messages. 

A recover may cause a delay. Indeed if the recovered process has a bigger identifier 
than the one of the leader then it will become the leader and will start new rounds, 
possibly preventing the old round from succeeding. As suggested in Lamport's original 
paper, one could use a different leader election strategy which keeps a leader as long 
as it does not fail. However it is not clear to us how to design such a strategy. 
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Chapter 7 



The MULTIPAXOS algorithm 



The PAXOS algorithm allows processes to reach consensus on one value. We consider 
now the situation in which consensus has to be reached on a sequence of values; more 
precisely, for each integer k, processes need to reach consensus on the k-th value. The 
MULTIPAXOS algorithm reaches consensus on a sequence of values; it was discovered 
by Lamport at the same time as PAXOS [29]. 

7.1 Overview 

To achieve consensus on a sequence of values we can informally use an instance of 
PAXOS for each integer k, so that the k-th instance is used to agree on the k-th value. 
Since we need an instance of PAXOS to agree on the k-th value, we need for each 
integer k an instance of the BASICPAXOS and STARTERALG automata. To distinguish 
instances we use an additional parameter that specifies the ordinal number of the 
instance. So, we have BASICPAXOS(l), BASICPAXOS(2), BASICPAXOS(3), etc., where 
BASICPAXOS(A;) is used to agree on the k-th value. This additional parameter will be 
present in each action. For instance, the Init(u) 8 - and Decide(V) 8 - actions of process 
i become Init (k,v)i and Decide(&, v')i in BASICPAXOS(A;) 8 '. Similar modifications are 
needed for all other actions. The STARTERALG; automaton for process i has to be 
modified in a similar way. Also, messages belonging to the k-th instance need to be 
tagged with k. 
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This simple approach has the problem that an infinite number of instances must 
be started unless we know in advance how many instances of PAXOS are needed. We 
have not defined the composition of Clock GTA for an infinite number of automata 
(see Chapter 2). 

In the following section we follow a different approach consisting of modifying the 
BASICPAXOS and STARTERALG automata of PAXOS to obtain the MULTIPAXOS algo- 
rithm. This differs from the approach describe above because we do not have separate 
automata for each single instance. The MULTIPAXOS algorithms takes advantage of 
the fact that, in a normal situation, there is a unique leader that runs all the instances 
of PAXOS. The leader can use a single message for step I of all the instances. Similarly 
step 2 can also be handled grouping all the instances together. Then, from step 3 
on each instance must proceed separately; however step 3 is performed only when an 
initial value is provided. 

Though the approach described above is conceptually simple, it requires some 
change to the code of the automata we developed in Chapter 6. To implement MUL- 
TIPAXOS we need to modify BASICPAXOS and STARTERALG. Indeed BASICPAXOS and 
STARTERALG are designed to handle a single instance of PAXOS, while now we need to 
handle many instances all together for the first two steps of a round. In this section we 
design two automata similar to BASICPAXOS and STARTERALG that handle multiple 
instances of PAXOS. We call them MULTIBASICPAXOS and MULTISTARTERALG. 

7.2 Automaton MULTIBASICPAXOS. 

Automaton MULTIBASICPAXOS has, of course, the same structure as BASICPAXOS, 
thus goes through the same sequence of steps of a round with the difference that 
now steps I and 2 are executed only once and not repeated by each instance. The 
remaining steps are handled separately for each instance of PAXOS. 

When initiating new rounds MULTIBASICPAXOS uses the same round number for 
all the instances. This allows the leader to send only one "Collect" message to all 
the agents and this message serves for all the instances of PAXOS. When responding 
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to a "Collect" message for a round r, agents have to send information about all the 
instances of PAXOS in which they are involved; for each of them they have to specify 
the same information as in BASICPAXOS, i.e., the number of the last round in which 
they accepted the value being proposed and the value of that round. We recall that 
an agent, by responding to a "Collect" message for a round r, also commits to not 
accept the value of any round with round number less than r; this commitment is 
made for all the instances of PAXOS. 

Once the leader has executed steps 1 and 2, it is ready to execute step 3 for every 
instance for which there is an initial value. For instances for which there is no initial 
value provided, the leader can proceed with step 3 as soon as there will be an initial 
value. 

Next, we give a description of the steps of MULTIBASICPAXOS by relating them to 
those of BASICPAXOS, so that it is possible to emphasize the differences. 

1. To initiate a round, the leader sends a message to all agents specifying the 
number r of the new round and also the set of instances for which the leader 
already knows the outcome. This message serves as "Collect" message for all 
the instances of PAXOS for which a decision has not been reached yet. This is 
an infinite set, but only for a finite number of instances is there information 
to exchange. Since agents may be not aware of the outcomes of instances for 
which the leader has already reached a decision, the leader sends in the "Collect" 
message, along with the round number, also the instances of PAXOS for which 
it already knows the decision. 

2. An agent that receives a message sent in step 1 from the leader of the round, 
responds giving its own information about rounds previously conducted for all 
the instances of PAXOS for which it has information to give to the leader. This 
information is as in BASICPAXOS, that is, for each instance the agent sends 
the last round in which it accepted the proposed value and the value of that 
round. Only for a finite number of instances does the agent have information. 
The agent makes the same kind of commitment as in BASICPAXOS. That is 
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it commits, in any instance, to not accept the value of any round with round 
number less than r. An agent may have already reached a decision for instances 
for which the leader still does not know the decision. Hence the agent also 
informs the leader of any decision already made. 

3. Once the leader has gathered responses from a majority of the processes it can 
propose a value for each instance of PAXOS for which it has an initial value. As 
in BASICPAXOS, it sends a "Begin" message asking to accept that value. For 
instances for which there is no initial value, the leader does not perform this 
step. However, as soon as there is an initial value, the leader can perform this 
step. Notice that step 3 is performed separately for each instance. 

4. An agent that receives a message from the leader of the round sent in step 3 
of a particular instance, responds by accepting the proposed value if it is not 
committed for a round with a larger round number. 

5. If the leader of a round receives, for a particular instance, "Accept" messages 
from a majority of processes, then, for that particular instance, a decision is 
made. 

Once the leader has made a decision for a particular instance, it broadcasts that 
decision as in BASICPAXOS. 

It is worth to notice that since steps 1 and 2 are handled with all the instances 
grouped together, there is a unique info-quorum, while, since from step 3 on each 
instance proceeds separately, there is an accepting-quorum for each instance (two 
instances may have different accepting-quorums). 

Figures 7-1, 7-2, 7-3, 7-4 and 7-5 show the code fragments of automata BMPLEADER;, 
BMPAGENT; and BMPSUCCESS; for process i. Automaton MULTIBASICPAXOS; for pro- 
cess i is obtained composing these three automata. In addition to the code fragments, 
we provide here some comments. The hrst general comment is that MULTIBASIC- 
PAXOS is really similar to BASICPAXOS and the differences are just technicalities due 
to the fact that MULTIBASICPAXOS handles multiple instances of PAXOS all together 
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for the first two steps of a round. This clearly results in a more complicated code, at 
least for some parts of the automaton. We refer the reader to the description of the 
code of BASICPAXOS and in the following we give specific comments on those parts of 
the automaton that required significant changes. We will follow the same style used 
for BASICPAXOS by describing the messages used and, for each automaton, the state 
variables and the actions. 

Messages. Messages are as in BASICPAXOS. The structure of the messages is 
slightly different. The following description of the messages is done assuming that 
process i is the leader. 

f. "Collect" messages, m = (r, "Collect",!), W)ij. This message is as the "Collect" 
message of BASICPAXOS;. Moreover, it specifies also the set D of all the instances 
for which the leader already knows the decision and the set W of instances for 
which the leader has an initial value but not a decision yet. 

2. "Last" messages, ra=(r, "Last", D',W, {(k, b k ,v k )\k G W}) hl . As in BASICPAXOS; 
an agent responds to a "Collect" message with a "Last" message. The message 
includes a set D' containing pairs (k } Decision(k)) for all the instances for which 
the agent knows the decision and the leader does not. The message includes 
also a set W which contains all the instances of the set W of the "Collect" 
message plus those instances for which the agent has an initial value while the 
leader does not. Finally for each instance k in W the agent sends the round 
number r k of the latest accepted round for instance k and the value v k of round 
rk- 

3. "Begin" messages, m = (&, r, "Begin", u) 8J . This message is as in BASICPAXOS; 
with the difference that the particular instance k to which it is pertinent is 
specified. 

4. "Accept" messages, m = (&, r, "Accept" ) Jj ;. This message is as in BASICPAXOS; 
with the difference that the particular instance k to which it is pertinent is 
specified. 
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BMPLEADER,- 



Signature 

Input: 

Internal: 
Output: 

States: 



Receive(m)j 8 ', m G {"Last", "Accept", "Success", "OldRound" } 

Init(fc, v)i, NewRound;, Stop;, Recover;, Leader;, NotLeader; 

Collect;, GatherLast;, Continue(fc);, GatherAccept(fc); , GatherOldRound; 

Send(m)jj, m G {"Collect", "Begin"} 

BeginCast(fc);, RndSuccess(fc, i>); 



Status G {alive, stopped} 

IamLeader, a boolean 

Mode G {rnddone, collect ,gatherlast} 

Mode, array of values G {wait ,begincast , 

gatheraccept , decided, rnddone} 
Init Value, array of V U nil 
Decision, array of V U {nil} 
HighestRnd G TZ 
CurRnd G TZ 

RndValue, array of V U {nil} 
RndVFrom, array of 1Z 
RndlnfQuo G 2 1 
RndAccQuo, array of 2 1 
InMsgs, multiset of messages 
OutMsgs, multiset of messages 



Tasks and bounds: 

{Collect;, GatherLast;}, bounds [0,£] 

{Continue(fc);, BeginCast(fc);, GatherAccept(fc); , RndSuccess(fc, v)i : k G N}, bounds [0,£] 

{GatherOldRound;}, bounds [0,£] 

{Send(m);m G M}, bounds [0,£] 



initia 


.ly alive 


initia 


.ly false 


initia 


.ly rnddone 


initia 


ly all rnddone 


initia 


ly all nil 


initia 


ly all nil 


initia 


lly(0,i) 


initia 


lly(0,i) 


initia 


ly all nil 


initia 


Uy all (0,i) 


initia 


ny{} 


initia 


lly all {} 


initia 


ny{} 


initia 


ny{} 



Actions: 

input Stop; 
Eff: Status 



stopped 



input Leader; 
Eff: if Status = alive then 
IamLeader := true 

output Send(m);j 
Pre: Status = alive 
rriij G OutMsgs 
Eff: remove to; ,• from OutMsgs 



input Recover; 
Eff: Status := alive 

input NotLeader; 
Eff: if Status = alive then 
IamLeader := false 

input Receive(m)j ) ; 
Eff: if Status = alive then 
add rrij ; to InMsgs 



Figure 7-1: Automaton BMPLEADER for process i (part 1) 



110 



Actions: 




input NewRoundj- 


input Init 8 (A;, i>) 


Eff: if Status = alive then 


Eff: if Status = alive then 


CurRnd := HighestRnd +; 1 


InitValue(k) := i> 


HighestRnd := CurRnd 




Mode := collect 


internal Continue(fc); 


Mode(k) := rnddone 


Pre: Status = alive 




Mode(k) = wait 


internal Collect; 


RndValue(k) = nil 


Pre: Status = alive 


Eff: RndValue(k) := InitValue(k) 


Mode = collect 


Mode(k) := begincast 


Eff: RndlnfQuo := {} 




£) := {fc|_Dee«s«cm(A;) 7^ nil} 


output BeginCast(fc); 


Mk £D 


Pre: Status = alive 


RndValue(k) := InitValue(k) 


Mode(k) = begincast 


RndVFrom(k) := (0,i) 


Eff: Vj put 


RndAccQuo(k):= {} 


(k, CurRnd, "Begin" ,RndValue(k))i j 


VF := {& InitValue(k) ^ nil and 


in OutMsgs 


Decision(k) = nil} 


Mode(k) := gatheraccept 


Vj put ( CurRnd, "Collect", £>, VF); ,_,- 




in OutMsgs 


internal GatherAccept(fc); 


Mode := gatherlast 


Pre: Status = alive 




Mode(k) := gatheraccept 


internal GatherLast; 


m = (r, "Accept" )j 8 ' £ InMsgs 


Pre: Status = alive 


CurRnd = r 


Mode = gatherlast 


Eff: remove all copies of m from InMsgs 


m = (r ,"Last",£>, VF, 


RndAccQuo(k) := RndAccQuo(k) U {j} 


{(Mfc,^)l fc £ W})j,i £ InMsgs 


if |_Rji<L4ccQMo(fc)| > n/2 then 


CurRnd = r 


Decision(k) := RndValue(k) 


Eff: remove all copies of m from InMsgs 


Mode(k) := decided 


V(k,v) £ £> do 




Decisionals) := i> 


output RndSuccess(fc, Decision); 


Mode(k) := decided 


Pre: Status = alive 


RndlnfQuo := RndlnfQuo U {j} 


Mode(k) = decided 


Vfc e vf 


Eff: Mode(k) = rnddone 


if RndVFrom(k) < b^ 




and life 7^ nil then 


internal GatherOldRound; 


RndValue(k) := «& 


Pre: Status = alive 


RndVFrom(k) := r^ 


m=(r, "OldRound" ,r')j 8 ' £ InMsgs 


if \RndInfQuo\ > n/2 then 


CurRnd < r 


Mode := begincast 


Eff: remove all copies of m from InMsgs 


VA; 


HighestRnd := r' 


if RndValue(k) = nil and 




InitValue(k) ^ nil then 




RndValue(k) := InitValue(k) 




RndValue(k) ^ nil then 




Mode(k) := begincast 




else 




Mode(k) := wait 





Figure 7-2: Automaton BMPLEADER for process z (part 2) 
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BMPAGENT; 








Signature: 








Input: Receive(m)j i j, m G {"Collect", "Begir 


n 




Init(fc, v)i, Stop;, Recover; 






Internal: LastAccept;, Accept(fc); 








Output: Send(m)jj, m G {"Last" 


, "Accept" , ' 


OldRound" } 


States: 








Status G {alive, stopped} 


initially 


al 


ive 


LastB, array of 7Z 


initially 


all 


(0,0 


LastV, array of V U {nil} 


initially 


all 


nil 


Commit G 1Z 


initially (0, 


i) 


InMsgs, multiset of messages 


initially {} 




OutMsgs, multiset of messages 


initially {} 




Tasks and bounds: 








{LastAccept;}, bounds [0,£] 








{Accept(fc); : k G N} , bounds [0,£] 








{Send(m); : m G M}, bounds [0,£] 








Actions: 








input Stop; 






output Send(m);j 


Eff: Status := stopped 






Pre: Status = alive 
rriij G OutMsgs 


input Recover; 






Eff: remove m,-j from OutMsgs 


Eff: Status := alive 






input Receive(m)j ) ; 


internal LastAccept; 






Eff: if Status = alive then 


Pre: Status = alive 






add rriji to InMsgs 


m = (r, "Collect",!?, W) jyi G InMsgs 






Eff: remove all copies of m from 


InMsgs 




internal Accept(fc); 


if r > Commit then 






Pre: Status = alive 


Commit := r 






m = (k, r, "Begin" ,v)ji G InMsgs 


W" := {k £E\LastV(k) ^ 


nil, 




Eff: remove all copies of m from InMsgs 


Decisionals) = nil} 






if r > Commit then 


W' := W" U W 






put (A;, r, "Accept" );j in InMsgs 


D' := {(k,Decision(k) \ 






lastB(k) := r, lastV(k) := i> 


k (fi D, Decision(k) ^ ni! 


-} 




else 


put (r,"Last",_D',VF', 






put (r, "OldRound" , Commii)ij 


{(k,b k ,v k )\keW}) itj in 


OutMsgs 




in OutMsgs 


where b k =LastB(k), v k 


=IastV(k) 






else 






input Init(fc, i>); 


put (r, "OldRound" , Commii)ij 




Eff: if Status = alive then 


in OutMsgs 






lastV(k) := v 



Figure 7-3: Automaton BMPAGENT for process z 
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BMPSUCCESS; 

Signature: 

Input: 



Internal: 
Output: 



Receive(m)j 8 ', m G {Ack, Accept} 

Stop;, Recover;, Leader;, NotLeader;, RndSuccess(fc, i>); 
SendSuccess; , GatherSuccess(fc); , GatherAck;, Wait; 
Decide(fc, i>);, Send( "Success" ,i>);j 



Time-passage: v(i) 



State: 

Clock G M 

Status G {alive, stopped} 
Decision array of G V U {nil} 
IamLeader, a boolean 
Acked(j), array of boolean Vj G 1 
Prevsend G M U {nil} 
LastSend GMujoo} 
LastWait GMU{oo} 
LastGA(k) array of R U {00} 
LastGS(k) array of R U {00} 
LastSS(k) array of R U {00} 
InMsgs, multiset of messages 
OutMsgs, multiset of messages 



Actions: 

input Stop; 
Eff: Status := stopped 

input Recover; 
Eff: Status := alive 

input Leader; 
Eff: if Status = alive then 
IamLeader := true 

input NotLeader; 
Eff: if Status = alive then 
IamLeader := false 

input RndSuccess(fc, i>); 
Eff: if Status = alive then 
Decision := v 
LastSS:= Clock + £ 



initially arbitrary 


initially alive 


initially nil 


initially false 


initially all false 


initially nil 


initially 00 


initially 00 


initially 00 


initially 00 


initially 00 


initially {} 


initially {} 



input Receive(m)j ) ; 
Eff: if Status = alive then 
put rriji into InMsgs 
iirriji = (A;, "Ack") and 
LastGA(k) = 00 then 
LastGA(k) = Clock + £ 
if rriji = (&, "Success") and 
LastGS(k) = 00 then 
LastGS(k) = Clock +£ 

output Send(m);j 
Pre: Status = alive 
rriij G OutMsgs 
Eff: remove m;j from OutMsgs 
if OutMsgs is empty 

LastSend := 00 
else 
LastSend := Clock + £ 



Figure 7-4: Automaton BMPSUCCESS for process i (part 1} 
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internal SendSuccess 8 - 


internal GatherAckj- 


Pre: Status = alive, IamLeader = true 


Pre: Status = alive 


PrevSend = nil 


m = (fc/'Ack")^, G InMsgs 


3j ^ i, 3k s.t. Decisionals) ^ nil 


Eff: remove all copies of m from InMsgs 


and Acked(j, k) = false 


Acked(j, k) := true 


Eff: Vj ^ i, VA; s.t. Decision(k) ^ nil 


if no other (fc,"Ack") is in InMsgs then 


and Acked(j, k) = false 


LastGA(k) := oo 


put (&, "Success" ,Decision)ij 


else 


in OutMsgs 


LastGA(k) := Clock +£ 


PrevSend := Clock 




LastSend := Clock + £ 


internal Wait; 


LastWait := C7ocfc + 5£ + 2n£ + 2d 


Pre: Status = alive, PrevSend ^ nil 


LastSS := oo 


Clock > PrevSend + {U + 2n£ + 2d) 




Eff: PrevSend := nil 


internal GatherSuccess(fc) 8 ' 


LastWait := oo 


Pre: Status = alive 




m = (fc, "Success" ,v)ji G InMsgs 


time-passage v(t) 


Eff: remove all copies of m from InMsgs 


Pre: Status = alive 


Decision(k) := i> 


Eff: Let t' be s.t. 


put (fc,"Ack");j in OutMsgs 


Clock+t' < LastSend 




Clock+t' < LastWait 


output Decide(fc, v)i 


and for all k 


Pre: Status = alive 


Clock+t' < LastSS(k) 


Decision ^ nil 


Clock+t' < LastGS(k) 


Decision = i> 


Clock+t' < LastGA(k) 


Eff: none 


Clock := Clock+t' 



Figure 7-5: Automaton BMPSUCCESS for process i (part 2) 
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5. "OldRound" messages, m =(r,"01dRound",r') Jj8 '. This message is as in BASICPAXOS;. 
Notice that there is no need to specify any instance since when a new round is 
started, it is started for all the instances. 

6. "Success" messages, m = (k, "Success", v)ij. This message is as in BASICPAXOS; 
with the difference that the particular instance k to which it is pertinent is 
specified. 

7. "Ack" messages, m =(A;,"Ack") Jj8 '. This message is as in BASICPAXOS; with the 
difference that the particular instance k to which it is pertinent is specified. 

Most state variables and automaton actions are similar to the correspondent state 
variables and automaton actions of BASICPAXOS. We will only describe those state 
variables and automata actions that required significant changes. For variables we 
need to use arrays indexed by the instance number. Most of the actions are as in 
BASICPAXOS; with the difference that a parameter k specifying the instance is present. 
This is true especially for actions relative to steps 3, 4, and 5 and for BMPSUCCESS;. 
Actions relative to steps f and 2 needed major rewriting since in MULTIBASICPAXOS; 
they handle multiple instances of PAXOS all together. 

Automaton BMPLEADER;. Variables Init Value, Decision, RndValue, RndVFrom 
and RndAccQuo are now arrays of variables indexed by the instance number: we 
need the information stored in these variables for each instance. Variable HighestRnd, 
CurRnd and RndlnfQuo are not arrays because there is always one current round 
number and only one info-quorum (used for all the instances). Variable Mode deserves 
some more comments: in BMPLEADER; we have a scalar variable Mode which is 
used for the first two steps, then, since from the third step on each instance is run 
separately, we have another variable Mode which is an array. Notice that values 
collect and gatherlast of variable Mode are relative to the first two steps of a 
round and that values wait, begincast, gatheraccept , decided are relative to 
the other steps of a round. Value rnddone is used either when no round has been 
started yet and also when a round has been completed. 
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Action Collect; first computes the set D of PAXOS instances for which a decision 
is already known. Then initializes the state variables pertinent to all the potential 
instances of PAXOS, which are all the ones not included in D. Notice that even though 
this is potentially an infinite set, we need to initialize those variables only for a finite 
number of instances. Then it computes the set W of instances of PAXOS for which the 
leader has an initial value but not yet a decision. Finally a "Collect" message is sent 
to all the agents. Action GatherLast; takes care of the receipt of the responses to the 
"Collect" message. It processes "Last" messages by updating, as BASICPAXOS; does, 
the state variables pertinent to all the instances for which information is contained in 
the "Last" message. Also if the agent is informing the leader of a decision of which 
the leader is not aware, then the leader immediately sets its Decision variable. When 
a "Last" message is received from a majority of the processes, the info-quorum is 
fixed. At this point, each instance for which there is an initial value can go on with 
step 3 of the round. Action Continue; takes care of those instances for which after 
the info-quorum is fixed by the GatherLast; action, there is no initial value. As soon 
as there is an initial value also these instances can proceed with step 3. Other actions 
are similar to the corresponding actions in BPLEADER;. 

Automaton BMPAGENT;. Variables LastB and LastV are now arrays of variables 
indexed by the instance number, while variable Commit is a scalar variable; indeed 
there is always only one round number used for all the instances. 

Action Last Accept; responds to the "Collect" message. If the agent is not commit- 
ted for the round number specified in the "Collect" message it commits for that round 
and sends to the leader the following information: the set D' of PAXOS instances for 
which the agent knows the decision while the leader does not, and for each of such 
instances, also the decision; for each instance in the set W of the "Collect" message 
and also for each instance for which the agent has an initial value while the leader 
does not, the usual information, about the last round in which the process accepted 
the value of the round and the value of that round, is included in the message. Action 
Accept(&); and Init(A;,u); are similar to the corresponding actions in BPAGENT;. 
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Automaton BMPSUCCESS;. This automaton is very similar to BPSUCCESS;. The 
only difference is that now the leader sends a "Success" message for any instance for 
which there is a decision and there are agents that have not sent an acknowledgment. 

7.3 Automaton MULTISTARTERALG 

As for BASICPAXOS, also for MULTIBASICPAXOS we need an automaton that takes 
care of starting new rounds when necessary, i.e., when a decision is not reached 
within some time bound. We call this automaton MULTISTARTERALG. The task 
of MULTISTARTERALG is the same as the one of STARTERALG: it has to check that 
rounds are successful within the expected time bound. This time bound checking 
must be done separately for each instance. 

Figure 7-6 shows automaton MULTISTARTERALG; for process i. The automaton is 
similar to automaton STARTERALG;. The difference is that the time bound checking 
is done, separately, for each instance. A new round is started if there is an instance 
for which a decision is not reached within the expected time bound. 

7.4 Correctness and analysis 

We do not prove formally the correctness of the code provided in this section. However 
the correctness follows from the correctness of PAXOS. Indeed for every instance of 
PAXOS, the code of MULTIPAXOS provided in this section does exactly the same thing 
that PAXOS does; the only difference is that step 1 (as well as step 2) is handled in a 
single shot for all the instances. It follows that Theorem 6.4.2 can be restated for each 
instance k of PAXOS. In the following theorem we consider a nice execution fragment 
a and we assume that i is eventually elected leader (by Lemma 5.2.2 this happens by 
time 4£ + 2d in a). 

In the following theorem t l a (k) denotes t l a for instance k. The formal definition of 
t l a (k) is obtained from the definition of t l a (see Definition 6.2.22) by changing Init(u) 8 - 
in Init(&, v)i. 
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MULTISTARTERALG; 








Signature: 








Input: Leader;, NotLeader;, Be 


ginCast(fc), 


, RndSuccess(fc, v)i , Stop;, Recover; 


Internal: CheckRndSuccess; 








Output: NewRound; 








Time-passage: v{t) 








States: 








Clock G M 


nitially arbitrary 


Status G {alive, stopped} 


nitially 


al 


ive 


IamLeader, a boolean 


nitially 


false 


Start, a boolean 


nitially 


false 


RegRnds C N 


nitially empty 


Deadline array of G M U {nil} 


nitially 


ni 


L 


LastNR £1U {00} 


nitially 


00 




Last array of £lU {00} 


nitially 


00 




RndSuccess, array of a boolean 


nitially 


false 


Actions: 








input Stop; 






output NewRound; 


Eff: Status := stopped 






Pre: Status = alive 
IamLeader =true 


input Recover; 






Start =true 


Eff: Status := alive 






Eff: Start := false 

LastNR := 00 


input Leader; 








Eff: if Status = alive then 






internal CheckRndSuccess(fc); 


if IamLeader = false then 






Pre: Status = alive 


IamLeader = true 






IamLeader = true 


For all k 






Deadline(k) ^ nil 


if RndSuccess(k) = false then 






Clock > Deadline(k) 


Deadhne(k) := nil 






Eff: Last(k) := 00 


Start := true 






if RndSuccess(k) = false then 


LastNR := Clock +£ 






Start := true 
LastNR := Clock + £ 


input NotLeader; 








Eff: if Status = alive then 






time-passage v{t) 


IamLeader := false 






Pre: Status = alive 
Eff: Let t' be s.t. 


input BeginCast(fc); 






Vfc, Clock +t' < Last(k) 


Eff: if Status = alive then 






and Clock + t' < LastNR 


Deadlme(k) := Clock + U + 2n£ 


+ 2d 




Clock := Clock + t' 


input RndSuccess(fc, v)i 








Eff: if Status = alive then 








RndSuccess(k) := true 








Last(k) := 00 









Figure 7-6: Automaton MULTISTARTERALG for process i 
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Theorem 7.4.1 Let a be a nice execution fragment of Smpx starting in a reachable 
state and lasting for more than t l a (k) + 24£ + 10n£ + 13 d. Then the leader i executes 
Decide(k } v')i by time t l a (k) + 21£ + 8n£-\- lid from the beginning of a and at most 8n 
messages are sent. Moreover by time t l a (k) + 24£ + 10n£ + 13c? from the beginning of 
a any alive process j executes Decide(k } v')j and at most 2n additional messages are 
sent. 

7.5 Concluding remarks 

In this chapter we have described the MULTIPAXOS protocol. MULTIPAXOS is a vari- 
ation of the PAXOS algorithm. It was discovered by Lamport at the same time as 
paxos [29]. 

MULTIPAXOS achieves consensus on a sequence of values utilizing an instance of 
PAXOS for each of them. AMP uses an instance of PAXOS to agree on each value of 
the sequence; remarks about PAXOS provided at the end of Chapter 6 apply also for 
MULTIPAXOS. We refer the reader to those remarks. 
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Chapter 8 



Application to data replication 



In this chapter we show how to use MULTIPAXOS to implement a data replication 
algorithm. 

8.1 Overview 

Providing distributed and concurrent access to data objects is an important issue in 
distributed computing. The simplest implementation maintains the object at a single 
process which is accessed by multiple clients. However this approach does not scale 
well as the number of clients increases and it is not fault-tolerant. Data replication 
allows faster access and provides fault tolerance by replicating the data object at 
several processes. 

One of the best known replication techniques is majority voting (e.g., [20, 23]). 
With this technique both update (write) and non-update (read) operations are per- 
formed at a majority of the processes of the distributed system. This scheme can 
be extended to consider any "write quorum" for an update operation and any "read 
quorum" for a non-update operation. Write quorums and read quorums are just sets 
of processes satisfying the property that any two quorums, one of which is a write 
quorum and the other one is a read quorum, intersect (e.g., [16]). A simple quo- 
rum scheme is the write- all/read-one scheme (e.g., [6]) which gives fast access for 
non-update operations. 
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Another well-known replication technique relies on a primary copy. A distin- 
guished process is considered the primary copy and it coordinates the computation: 
the clients request operations of the primary copy and the primary copy decides 
which other copies must be involved in performing the operation. The primary copy 
technique works better in practice if the primary copy does not fail. Complex recov- 
ery mechanisms are needed when the primary copy crashes. Various data replication 
algorithms based on the primary copy technique have been devised (e.g., [13, 14, 34]). 

Replication of the data object raises the issue of consistency among the replicas. 
These consistency issues depend on what requirements the replicated data has to 
satisfy. The strongest possible of such requirements is atomicity: clients accessing 
the replicated object obtain results as if there was a unique copy. Primary copy algo- 
rithms [1, 34] and voting algorithms [20, 23] are used to achieve atomicity. Achieving 
atomicity is expensive; therefore weaker consistency requirements are also considered. 
One of these weaker consistency requirements is sequential consistency [26], which al- 
lows operations to be re-ordered as long as they remain consistent with the view of 
individual clients. 

8.2 Sequential consistency 

In this section we formally define a sequential consistent read/update object. Sequen- 
tial consistency has been hrst defined by Lamport [26]. We base our definition on the 
one given in [15] which relies on the notion of atomic object [27, 28] (see also [35] for 
a description of an atomic object). 

Formally a read/update shared object is defined by the set O of the possible 
states that the object can assume, a distinguished initial state (9 , and set U of 
update operations which are functions up : O — > O. 

We assume that for each process i of the distributed system implementing the 
read/update shared object, there is a client i and that client i interacts only with 
process i. The interface between the object and the clients consists of request actions 
and report actions. In particular the client i requests a read by executing action 
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Request-read; and receives a report to the read request when Report-read(O); is exe- 
cuted; similarly a client i requests an update operation by executing action Request- 
update(up); and receives the report when action Report-update; is executed. 

If /3 is a sequence of actions, we denote by f3\i the subsequence of f3 consisting 
of Request-read; , Report-read(O);, Request-update(up); and Report-update;. This 
subsequence represents the interactions between client i and the read/update shared 
object. 

We will only consider client-well- formed sequence of actions f3 for which f3\i, for 
every client z, does not contain two request events without an intervening report, i.e., 
we assume that a client does not request a new operation before receiving the report 
of the previous request. A sequence of action f3 is complete if for every request event 
there is a corresponding report event. If /3 is a complete client- well-formed sequence 
of actions, we define the totally-precedes partial order on the operations that occur in 
f3 as follows: an operation o^ totally-precedes an operation o 2 if the report event of 
operation o^ occurs before the request event of operation o 2 . 

In an atomic object, the operations appear "as if" they happened in some sequen- 
tial order. The idea of "atomic object" originated in [27, 28]. Here we use the formal 
definition given in Chapter 13 of [35]. In a sequentially consistent object the above 
atomic requirement is weakened by allowing events to be reordered as long as the view 
of each client i does not change. Formally a sequence f3 of request /report actions is 
sequentially consistent if there exists an atomic sequence 7 such that *y\i = f3\i, for 
each client i. That is, a sequentially consistent sequence "looks like" an atomic se- 
quence to each individual client, even though the sequence may not be atomic. A 
read/update shared object is sequentially consistent if all the possible sequence of 
request /report actions are sequentially consistent. 

8.3 Using multipaxos 

In this section we will see how to use MULTIPAXOS to design a data replication algo- 
rithm that guarantees sequential consistency and provides the same fault tolerance 
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properties of MULTIPAXOS. The resulting algorithm lies between the two replication 
techniques discussed at the beginning of the chapter. It is similar to voting schemes 
since it uses majorities to achieve consistency and it is similar to primary copy tech- 
niques since a unique leader is required to achieve termination. Using MULTIPAXOS 
gives much flexibility. For instance, it is not a disaster when there are two or more 
"primary" copies. This can only slow down the computation, but never results in 
inconsistencies. The high fault tolerance of MULTIPAXOS results in a highly fault 
tolerant data replication algorithm, i.e., process stop and recovery, loss, duplication 
and reordering of messages, timing failures are tolerated. However liveness is not 
guaranteed: it is possible that a requested operation is never installed. 

We can use MULTIPAXOS in the following way. Each process in the system main- 
tains a copy of the data object. When client i requests an update operation, process 
i proposes that operation in an instance of MULTIPAXOS. When an update operation 
is the output value of an instance of MULTIPAXOS and the previous update has been 
applied, a process updates its local copy and the process that received the request 
for the update gives back a report to its client. A read request can be immediately 
satisfied returning the current state of the local copy. 

It is clear that the use of MULTIPAXOS gives consistency across the whole sequence 
upi } up 2} up 3} ... of update operations, since each operation is agreed upon by all the 
processes. In order for a process to be able to apply operation upk } the process must 
hrst apply operation upk-i- Hence it is necessary that there be no gaps in the sequence 
of update operations. A gap is an integer k for which processes never reach a decision 
on the k-th update (this is possible if no process proposes an update operation as 
the k-th one). Though making sure that the sequence of update operations does not 
contain a gap enables the processes to always apply new operations, it is possible to 
have a kind of "starvation" in which a requested update operation never gets satisfied 
because other updates are requested and satisfied. We will discuss this in more detail 
later. 
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8.3.1 The code 

Figures 8-1 and 8-2 show the code of automaton DATAREPLICATION; for process i. 
This automaton implements a data replication algorithm using MULTIPAXOS as a 
subroutine. It accepts requests from a client; read requests are immediately satisfied 
by returning the current state of the local copy of the object while update requests 
need to be agreed upon by all the processes and thus an update operation is proposed 
in the various instances of PAXOS until the operation is the outcome of an instance 
of PAXOS. When the requested operation is the outcome of a particular instance k 
of MULTIPAXOS and the (k — l)-th update operation has been applied to the object, 
then the k-th update operation can be applied to the object and a report can be given 
back to the client that requested the update operation. 

Figure 8-3 shows the interactions between the DATAREPLICATION automaton and 
MULTIPAXOS and also the interactions between the DATAREPLICATION automaton 
and the clients. 

To distinguish operations requested by different clients we pair each operation up 
with the identifier of the client requesting the update operation. Thus the set V of 
possible initial values for the instances of PAXOS is the set of pairs (up, z), where up 
is an operation on the object O and i £ X is a process identifier. 

Next we provide some comments about the code of automaton DATAREPLICATION;. 

Automaton actions. Actions Request-update(up);, Request-read;, Report-update; 
and Report-read(O); constitute the interface to the client. A client requests an up- 
date operation up by executing action Request-update(up); and gets back the result 
r when action Report-update(r); is executed by the DATAREPLICATION; automaton. 
Similarly a client requests a read operation by executing action Request-read; and 
gets back the status of the object O when action Report-read(O); is executed by the 
DATAREPLICATION; automaton. 

A read request is satisfied by simply returning the status of the local copy of the 
object. Action Request-read; sets the variable CurRead to the current status O of 
the local copy and action Report-read(O); reports this status to the client. 
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DATAREPLICATION; 




Signature: 




Input: Receive(m)j i j, Decide(fc, v)i, 


Request-update(Mp) 8 - , Request-read 8 - 


Internal: SendWantPaxos 8 -, RecWantP 


axosi, Update^, RePropose(fc) 8 ' 


Output: Send(m)jj, Init(fc, v)i, Report- update^, Report-read(0) 8 - 


States: 




Propose, array of V U {nil}, 


nitially nil everywhere 


Decision, array of V U {nil}, 


nitially nil everywhere 


S, an integer, 


nitially 1 


X, a pair (0, k) with £ O, k £ N 


nitially (O ,0) 


CurRead £ O U {nil}, 


nitially nil 


Proposed, array of booleans, 


nitially false everywhere 


Reproposed, array of booleans, 


nitially false everywhere 


InMsgs, multiset of messages, 


nitially {} 


OutMsgs, multiset of messages, 


nitially {} 


Tasks and bounds: 




{Initj}, bounds [0,£] 




{RecWantPaxoSj}, bounds [0,£] 




{SendWantPaxoSj}, bounds [0,£] 




{Report-update 8 -, Update^}, bounds [0,£ 


] 


{Report-read(0) 8 }, bounds [0,£] 




{RePropose(fc) 8 }, bounds [0,£] 




{Send(m)jj : m £ Ad}, bounds [0,£] 





Figure 8-1: Automaton DATAREPLICATION for process i (part 1) 
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Actions: 






output Send(m)jj 


internal RecWantPaxoSj- 




Pre: m,-j £ OutMsgs 


Pre: m=( "WantPaxos" , fc, (up, 


j)) in InMsgs 


Eff: remove m,-j from OutMsgs 


Eff: remove m from InMsgs 
if Propose(k) = nil then 




input Receive(m)j i j 


Propose(k) := (up,j) 




Eff: add m to InMsgs 


S :=k + l 






VA; < 5 s.t. Propose(k) 


= nil do 


input Request-readj- 


Propose(k) := dummy 




Eff: CurRead := 0, where X = (0, k) 


output Report-update 8 - 




output Report-read(0) 8 - 


Pre: Decision(k) = (up, i) 




Pre: CurRead = 


Propose(k) = (up, i) 




Eff: CurRead := nil 


X = (0,fc-l) 
Eff: X := (up(0),fc) 




input Request-update(up) 8 - 






Eff: Propose(S) := (up, i) 


internal Update^ 




5 :=5+l 


Pre: Decision(k) = (up,j) 
3 ± i 




output Initj(fc, (up,j)) 


X = (0,k-l) 




Pre: Propose(k) = (up, j) 


Eff: X := (up(0),k) 




Proposed(k) = false 






Decision(k) = nil 


internal RePropose(fc) 8 ' 




Eff: Proposed(k) := true 


Pre: Propose(k) = (up, i) 
Decision(k) ^ (up, i) 




internal SendWantPaxoSj- 


Decision(k) ^ nil 




Pre: Propose(k) = (up, i) 


Reproposed(k) = false 




Decision(k) = nil 


Eff: Reproposed(k) := true 




Eff: Vj put ("WantPaxos" ,S,(up, i))ij 


Propose(S) := (up, i) 




in OutMsgs 


5:= 5+1 

input Decide(fc, (up,j))i 
Eff: Decision(k) := (up,j) 





Figure 8-2: Automaton DATAREPLICATION for process z (part 2) 
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To satisfy an update request the requested operation must be agreed upon by 
all processes. Hence it has to be proposed in instances of MULTIPAXOS until it is 
the outcome of an instance. A Request-update(up) 8 - action has the effect of setting 
Propose(k) } where k = S, to (up } i); action Init(&, (up } j)); 1 is then executed so that 
process i has (up } j) as initial value in the k-th instance of PAXOS. However since 
process i may be not the leader it has to broadcast a message to request the leader 
to run the k-th instance (the leader may be waiting for an initial value for the k-th 
instance). Action SendWantPaxos; takes care of this by broadcasting a "WantPaxos" 
message specifying the instance k and also the proposed operation (up } i) so that any 
process that receives this message (and thus also the leader) and has its Propose(k) 
value still undefined will set it to (up } i). Action RecWantPaxos takes care of the re- 
ceipt of "WantPaxos" messages. Notice that whenever the receipt of a "WantPaxos" 
message results in setting Propose(k) to the operation specified in the message, pos- 
sible gaps in the sequence of proposed operation are filled with a dummy operation 
which has absolutely no effect on the object O. This avoids gap in the sequence of 
update operations. 

When the k-th instance of PAXOS reaches consensus on a particular update op- 
eration (up } i) } the update can be applied to the object (given that the (k — l)-th 
update operation has been applied to the object) and the result of the update can be 
given back to the client that requested the update operation. This is done by action 
Report-update(r) 8 -. Action Update; only updates the local copy without reporting 
anything to the client if the operation was not requested by client i. If process i pro- 
posed an operation up as the k-th one and another operation is installed as the k-th 
one, then process i has to re-propose operation up in another instance of PAXOS. This 
is done in action RePropose;. Notice that process i has to re-propose only operations 
that it proposed, i.e., operations of the form (up } i). 



1 Notice that we used the identifier j since process i may propose as its initial value the operation of 
another process j if it knows that process j is proposing that operation (see actions SendWantPaxos 8 - 
and RecWantPaxoSj). 
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State variables. Propose is an array used to store the operations to propose as 
initial values in the instances of PAXOS. Decision is an array used to store the out- 
comes of the instances of PAXOS. The integer S is the index of the hrst undefined 
entry of the array Propose. This array is kept in such a way that it is always de- 
fined up to Proposers — I) and is undefined from Propose(S). Variable X describes 
the current state of the object. Initially the object is in its initial state O . The 
DATAREPLICATION; automaton keeps an updated copy of the object, together with 
the index of the last operation applied to the object. Initially the object is described 
by (0o,O). Let Ok be the state of the object after the application to O of the hrst 
k operations. When variable X = (0, k) } we have that O = Ok- When the outcome 
Decision(k-\- 1)= (up } i) of the (k-\- l)-th instance of Paxos is known and current state 
of the object is (0, k) } the operation up can be applied and process i can give back a 
response to the client that requested the operation. 

Variable CurRead is used to give back the report of a read. Variable Proposed(k) 
is a flag indicating whether or not an Init(&, v)i action for the k-th instance has been 
executed, so that the Init (k,v)i action is executed at most once (though executing 
this action multiple times does not affect PAXOS). Similarly Reproposed(k) is a flag 
used to re-propose only once an operation that has not been installed. Notice that 
an operation must be re-proposed only once because a re-proposed action will be 
re-proposed again if it is not installed. 

8.3.2 Correctness and analysis 

We do not prove formally the correctness of the DATAREPLICATION algorithm. By 
correctness we mean that sequential consistency is never violated. Intuitively, the 
correctness of DATAREPLICATION follows from the correctness of MULTIPAXOS. Indeed 
all processes agree on each update operation to apply to the object: the outcomes of 
the various instances of PAXOS give the sequence of operations to apply to the object 
and each process has the same sequence of update operations. 
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Theorem 8.3.1 Let a be an execution of the system consisting o/DATAREPLICATION 
and MULTIPAXOS. Let fi be the subsequence of a consisting of the request/report events 
and assume that f3 is complete. Then f3 is sequentially consistent. 

Proof sketch: To see that f3 is sequentially consistent it is sufficient to give an atomic 
request /report sequence 7 such that *y\i = f3\i, for each client i. The sequence 7 can be 
easily constructed in the following way: let upi, up 2} up 3} ... be the sequence of update 
operations agreed upon by all the processes; let 7' be the request /report sequence 
Request-update(upi) 8l ,Report-update 8l , Request-update(up 2 )8' 2 5 Report-update 8 - 2 , ...; 
then 7 is the sequence obtained by 7' by adding Request-read, Report-read events in 
the appropriate places (i.e., if client i requested a read when the status of the local 
copy was Ok, then place Request-read;, Report-read((9fc) 8 ', between Report-update^ 
and Request-update(upfc + i)i ;i ). I 

Liveness is not guaranteed. Indeed it is possible that an operation is never satisfied 
because new operations could be requested and satisfied. Indeed PAXOS guarantees 
validity but any initial value can be the final output value, thus when an operation is 
re-proposed in subsequent instances, it is not guaranteed that eventually it will be the 
outcome of an instance of PAXOS if new operations are requested. A simple scenario is 
the following. Process I and process 2 receive requests for update operations up\ and 
up 2} respectively. Instance I of PAXOS is run and operation up 2 proposed by process 
2 is installed. Thus process I re-proposes its operation in instance 2. Process 3 has, 
meanwhile, received a request for update operation up 3 and proposes it in instance 
2. The operation up 3 of process 3 is installed in instance 2. Again process I has to 
re-propose its operation in a new instance. Nothing guarantees that process I will 
eventually install its operation up\ if other processes keep proposing new operations. 
This problem could be avoided by using some form of priority for the operations to 
be proposed by the leader in new instances of PAXOS. 

The algorithm exhibits the same fault tolerance properties of PAXOS: process stop 
and recovery, message loss, duplication and reordering and timing failures. However, 
as in PAXOS, to get progress it is necessary that the system executes a long enough 
nice execution fragment. 
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8.4 Concluding remarks 

The application of MULTIPAXOS to data replication that we have presented in this 
chapter is intended only to show how MULTIPAXOS can be used to implement a data 
replication algorithm. A better data replication algorithm based on MULTIPAXOS can 
certainly be designed. We have not provided a proof of correctness of this algorithm; 
also the performance analysis is not given. There is work to be done to obtain a good 
data replication algorithm. 

For example, it should be possible to achieve liveness by using some form of 
priority for the operations proposed in the various instances of PAXOS. The easiest 
approach would use a strategy such that an operation that has been re-proposed 
more than another one, has priority, that is, if the leader can choose among several 
operations, it chooses the one that has been re-proposed most. This should guarantee 
that requested operations do not "starve" and are eventually satisfied. 

In this chapter we have only sketched how to use PAXOS to implement a data 
replication algorithm. We leave the development of a data replication algorithm 
based on PAXOS as future work. 
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Request-update (op ) i Report-update (r) ■ 

Request-read,- \ Report-read (O) ,■ 




Figure 8-3: Data replication 
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Chapter 9 



Conclusions 



The consensus problem is a fundamental problem in distributed systems. It plays a 
key role practical problems involving distributed transactions. In practice the compo- 
nents of a distributed systems are subject to failures and recoveries, thus any practical 
algorithm should cope as much as possible with failures and recoveries. PAXOS is a 
highly fault-tolerant algorithm for reaching consensus in a partially synchronous dis- 
tributed system. MULTIPAXOS is a variation of PAXOS useful when consensus has to 
be reached on a sequence of values. Both PAXOS and MULTIPAXOS were devised by 
Lamport [29]. 

The PAXOS algorithm combines high fault-tolerance with efficiency; safety is main- 
tained despite process halting and recovery, messages loss, duplication and reordering, 
and timing failures; also, when there are no failures nor recoveries and a majority of 
processes are alive for a sufficiently long time, PAXOS reaches consensus using linear, 
in the number of processes, time and messages. 

PAXOS uses the concept of a leader, i.e., a distinguished process that leads the 
computation. Unlike other algorithms whose correctness is jeopardized if there is not 
a unique leader, PAXOS is safe also when there are no leaders or more than one leader; 
however to get progress there must be a unique leader. This nice property allows us 
to use a sloppy leader elector algorithm that guarantees the existence of a unique 
leader only when no failures nor process recoveries happen. This is really important 
in practice, since in the presence of failures it is practically not possible to provide a 
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reliable leader elector (this is due to the difficulty of detecting failures). 

Consensus algorithms currently used in practice are based on the 2-phase commit 
algorithm (e.g., [2, 25, 41, 48], see also [22]) and sometime on the 3-phase commit 
algorithm (e.g. [47, 48]). The 2-phase commit protocol is not at all fault tolerant. The 
reason why it is used in practice is that it is very easy to implement and the proba- 
bility that failures affect the protocols is low. Indeed the time that elapses from the 
beginning of the protocol to its end is usually so short that the possibility of failures 
becomes irrelevant; in small networks, messages are delivered almost instantaneously 
so that a 2-phase commit takes a very short time to complete; however the protocol 
blocks if failures do happen and recovery schemes need to be invoked. Protocols that 
are efficient when no failures happen yet highly fault tolerant are necessary when 
the possibility of failures grows significantly, as happens, for example, in distributed 
systems that span wide areas. The PAXOS algorithm satisfy both requirements. 

We believe that PAXOS is the most practical solution to the consensus problem 
currently available. 

In the original paper [29], the PAXOS algorithm is described as the result of discov- 
eries of archaeological studies of an ancient Greek civilization. That paper contains 
a sketch of a proof of correctness and a discussion of the performance analysis. The 
style used for the description of the algorithm often diverts the reader's attention. 
Because of this, we found the paper hard to understand and we suspect that others 
did as well. Indeed the PAXOS algorithm, even though it appears to be a practical 
and elegant algorithm, seems not widely known or understood, either by distributed 
systems researchers or distributed computing theory researchers. 

In this thesis we have provided a new presentation of the PAXOS algorithm, in 
terms of I/O automata; we have also provided a correctness proof and a time per- 
formance and fault-tolerance analysis. The correctness proof uses automaton com- 
position and invariant assertion methods. The time performance and fault-tolerance 
analysis is conditional on the stabilization of the system behavior starting from some 
point in an execution. Stabilization means that no failures nor recoveries happen 
after the stabilization point and a majority of processes are alive for a sufficiently 
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long time. 

We have also introduced a particular type of automaton model called the Clock 
GTA. The Clock GTA model is a particular type of the general timed automaton 
(GTA) model. The GTA model has formal mechanisms to represent the passage of 
time. The Clock GTA enhances those mechanisms to represent timing failures. We 
used the Clock GTA to provide a technique for practical time performance analysis 
based on the stabilization of the physical system. We have used this technique to 
analyze PAXOS. 

We also have described MULTIPAXOS and discussed an example of how to use 
MULTIPAXOS for data replication management. Another immediate application of 
PAXOS is to distributed commit. PAXOS bears some similarities with the 3-phase 
commit protocol; however 3-phase commit, needs in practice a reliable failure detector. 

Our presentation of PAXOS has targeted the clarity of presentation of the algo- 
rithm; a practical implementation does not need to be as modular as the one we 
have presented. For example, we have separated the leader behavior of a process 
into two parts, one that takes care of leading a round and another one that takes 
care of broadcasting a reached decision; this has resulted in the duplication of state 
information and actions. In a practical implementation it is not necessary to have 
such a separation. Also, a practical algorithm could use optimizations such as mes- 
sage retransmission, so that the loss of one message does not affect the algorithm, 
or waiting larger time-out intervals before abandoning a round, so that a little delay 
does not force the algorithm to start a new round. 

Further directions of research concern improvements of PAXOS. For example it is 
not clear whether a clever strategy for electing the leader can help in improving the 
overall performance of the algorithm. We used a simple leader election strategy which 
is easy to implement, but we do not know if more clever leader election strategies 
may positively affect the efficiency of PAXOS. Also, it would be interesting to provide 
performance analysis for the case when there are failures, in order to measure how 
badly the algorithm can perform. For this point, however, one should keep in mind 
that PAXOS does not guarantees termination in the presence of failures. We remark 
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that allowing timing failures and process stopping failures the problem is unsolvable. 
However in some cases termination is achieved even in the presence of failures, e.g., 
only a few messages are lost or a few processes stop. 

It would be interesting to compare the use of PAXOS for data replication with 
other related algorithms such as the data replication algorithm of Liskov and Oki. 
Their work seems to incorporate ideas similar to the ones used in PAXOS. 

Also the virtual synchrony group communication scheme of Fekete, Lynch and 
Shvartsman [16] based on previous work by Amir et. al. [3], Keidar and Dolev [24] 
and Cristian and Schmuck [7], uses ideas somewhat similar to those used by PAXOS: 
quorums and timestamps (timestamps in PAXOS are basically the round numbers). 

Certainly a further step is a practical implementation of the PAXOS algorithm. We 
have shown that PAXOS is very efficient and fault tolerant in theory. While we are sure 
that PAXOS exhibits good performance from a theoretical point of view, we still need 
the support of a practical implementation and the comparison of the performance of 
such an implementation with existing consensus algorithms to affirm that PAXOS is 
the best currently available solution to the consensus problem in distributed systems. 

We recently learned that Lee and Thekkath [33] used PAXOS to replicate state 
information within their Petal systems which implements a distributed hie server. In 
the Petal system several servers each with several disks cooperate to provide to the 
users a virtual, big and reliable storage unit. Virtual disks can be created and deleted. 
Servers and physical disks, may be added or removed. The information stored on the 
physical disks is duplicated to some extent to cope with server and or disk crashes 
and load balancing is used to speed up the performance. Each server of the Petal 
system needs to have a consistent global view of the current system configuration; this 
important state information is replicated over all servers using the PAXOS algorithm. 
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Appendix A 



Notation 



This appendix contains a list of symbols used in the thesis. Each symbol is listed 
with a brief description and a reference to the pages where it is defined. 

n number of processes in the distributed system. (37) 

X ordered set of n process identifiers. (37) 

£ time bound on the execution of an enabled action. (37) 

d time bound on the delivery of a message. (37) 

V set of initial values. (47) 

1Z set of round numbers. A round number is a a pair (x, z), 

where i£l and ifN. Round numbers are totally ordered. (65) 

Hleader(r) history variable. The leader of round r. (80) 

Hvalue(r) history variable. The value of round r. (80) 

Hf rom(r) history variable. The round from which the value of round r is taken. (80) 

Hinf quo(r) history variable. The info-quorum of round r. (80) 

Haccquo(r) history variable. The accepting-quorum of round r. (80) 

Hreject(r) history variable. Processes committed to reject round r. (80) 

IZs set of round numbers of rounds for which Hleader is set. (82) 

IZy set of round numbers of rounds for which Hvalue is set. (82) 

t l a time of occurrence of Init(u) 8 - in a. (93) 

T\ max of U + 2n£ + 2d and f„ + 21 (82) 
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S'cha distributed system consisting of CHANNEL^, for i,j £ Z (42) 

S'det distributed system consisting of ScHA an d DETECTOR;, for i £ X (52) 

S'lea distributed system consisting of Sdet an d LEADERELECTOR;, 

for i £ X. (56) 
S'bpx distributed system consisting of S'cha an d BPLEADER;, BPAGENT; 

and BPSUCCESS; for i £ X. (80) 
S'pax distributed system consisting of Slea an d BPLEADER 8 -, BPAGENT;, 

BPSUCCESS,- and STARTERALG,- for i £ X. (100) 



regular time-passage step, a time-passage step vit) that increases the local clock of 
each Clock GTA by t. (26) 

regular execution fragment, an execution fragment whose time-passage steps are all 
regular. (26) 

stable execution fragment, a regular execution fragment with no process crash or 
recoveries and no loss of messages. (38-42) 

nice execution fragment, a stable execution fragment with a majority of processes 
alive. (43) 

start of a round, is the execution of action NewRound for that round. (91) 

end of a round, is the execution of action RndSuccess for that round. (91) 

successful round, a round is successful when it ends, i.e., when action RndSuccess for 
that round is executed. (91) 
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